linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
@ 2021-09-08 10:50 fdmanana
  2021-09-09 19:21 ` Boris Burkov
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: fdmanana @ 2021-09-08 10:50 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.

For a direct IO read we get a trace like this:

[  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
[  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
[  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
[  967.875992] Call Trace:
[  967.875999]  __schedule+0x3ca/0xe10
[  967.876015]  schedule+0x43/0xe0
[  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
[  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
[  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
[  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
[  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
[  967.876253]  ? lru_cache_add+0x104/0x220
[  967.876255]  ? kvm_sched_clock_read+0x14/0x40
[  967.876258]  ? sched_clock_cpu+0xd/0x110
[  967.876263]  ? lock_release+0x155/0x4a0
[  967.876271]  read_pages+0x86/0x270
[  967.876274]  ? lru_cache_add+0x125/0x220
[  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
[  967.876291]  filemap_fault+0x626/0xa20
[  967.876303]  __do_fault+0x36/0xf0
[  967.876308]  __handle_mm_fault+0x83f/0x15f0
[  967.876322]  handle_mm_fault+0x9e/0x260
[  967.876327]  __get_user_pages+0x204/0x620
[  967.876332]  ? get_user_pages_unlocked+0x69/0x340
[  967.876340]  get_user_pages_unlocked+0xd3/0x340
[  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
[  967.876366]  iov_iter_get_pages+0x8d/0x3a0
[  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
[  967.876379]  ? lock_release+0x155/0x4a0
[  967.876387]  iomap_dio_bio_actor+0x232/0x410
[  967.876396]  iomap_apply+0x12a/0x4a0
[  967.876398]  ? iomap_dio_rw+0x30/0x30
[  967.876414]  __iomap_dio_rw+0x29f/0x5e0
[  967.876415]  ? iomap_dio_rw+0x30/0x30
[  967.876420]  ? lock_acquired+0xf3/0x420
[  967.876429]  iomap_dio_rw+0xa/0x30
[  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
[  967.876460]  new_sync_read+0x118/0x1a0
[  967.876472]  vfs_read+0x128/0x1b0
[  967.876477]  __x64_sys_pread64+0x90/0xc0
[  967.876483]  do_syscall_64+0x3b/0xc0
[  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  967.876490] RIP: 0033:0x7fb6f2c038d6
[  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
[  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
[  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
[  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000

This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.

For a direct IO write, the scenario is a bit different, and it results in
trace like this:

[ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
[ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
[ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
[ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
[ 1330.351906] Call Trace:
[ 1330.351913]  __schedule+0x3ca/0xe10
[ 1330.351930]  schedule+0x43/0xe0
[ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
[ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
[ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
[ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
[ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
[ 1330.352133]  ? lru_cache_add+0x104/0x220
[ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
[ 1330.352138]  ? sched_clock_cpu+0xd/0x110
[ 1330.352143]  ? lock_release+0x155/0x4a0
[ 1330.352151]  read_pages+0x86/0x270
[ 1330.352155]  ? lru_cache_add+0x125/0x220
[ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
[ 1330.352172]  filemap_fault+0x626/0xa20
[ 1330.352176]  ? filemap_map_pages+0x18b/0x660
[ 1330.352184]  __do_fault+0x36/0xf0
[ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
[ 1330.352203]  handle_mm_fault+0x9e/0x260
[ 1330.352208]  __get_user_pages+0x204/0x620
[ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
[ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
[ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
[ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
[ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
[ 1330.352259]  ? lock_release+0x155/0x4a0
[ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
[ 1330.352275]  iomap_apply+0x12a/0x4a0
[ 1330.352278]  ? iomap_dio_rw+0x30/0x30
[ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
[ 1330.352294]  ? iomap_dio_rw+0x30/0x30
[ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
[ 1330.352339]  new_sync_write+0x11f/0x1b0
[ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
[ 1330.352354]  vfs_write+0x292/0x3c0
[ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
[ 1330.352365]  do_syscall_64+0x3b/0xc0
[ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1330.352372] RIP: 0033:0x7f4b0a580986
[ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
[ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
[ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
[ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).

Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.

This depends on the iov_iter and iomap changes done by a recent patchset
from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
moment of this writing. The cover letter has the following subject:

   "[PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks"

The thread can be found at:

https://lore.kernel.org/all/20210827164926.1726765-1-agruenba@redhat.com/

Fixing these issues could be done without the iov_iter and iomap changes
introduced in that patchset, however it would be much more complex due to
the need of reordering some operations for writes and having to be able
to pass some state through nested and deep call chains, which would be
particularly cumbersome for reads - for example make the readahead and
the endio handlers for page reads be aware we are in a direct IO read
context and know which inode and extent range we locked before.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

As noted in the changelog, this currently depends on an unmerged patchset
that changes the iov_iter and iomap code. Unfortunately without that
patchset merged, the solution for this bug would be much more complex
and hairy.

 fs/btrfs/file.c | 128 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 112 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9d41b28c67ba..a020fa5b077c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1904,16 +1904,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 
 static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
+	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	loff_t pos;
 	ssize_t written = 0;
 	ssize_t written_buffered;
+	size_t prev_left = 0;
 	loff_t endbyte;
 	ssize_t err;
 	unsigned int ilock_flags = 0;
-	struct iomap_dio *dio = NULL;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -1956,23 +1957,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		goto buffered;
 	}
 
-	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			     0, 0);
+	/*
+	 * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
+	 * calls generic_write_sync() (through iomap_dio_complete()), because
+	 * that results in calling fsync (btrfs_sync_file()) which will try to
+	 * lock the inode in exclusive/write mode.
+	 */
+	if (is_sync_write)
+		iocb->ki_flags &= ~IOCB_DSYNC;
 
-	btrfs_inode_unlock(inode, ilock_flags);
+	/*
+	 * The iov_iter can be mapped to the same file range we are writing to.
+	 * If that's the case, then we will deadlock in the iomap code, because
+	 * it first calls our callback btrfs_dio_iomap_begin(), which will create
+	 * an ordered extent, and after that it will fault in the pages that the
+	 * iov_iter refers to. During the fault in we end up in the readahead
+	 * pages code (starting at btrfs_readahead()), which will lock the range,
+	 * find that ordered extent and then wait for it to complete (at
+	 * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
+	 * obviously the ordered extent can never complete as we didn't submit
+	 * yet the respective bio(s). This always happens when the buffer is
+	 * memory mapped to the same file range, since the iomap DIO code always
+	 * invalidates pages in the target file range (after starting and waiting
+	 * for any writeback).
+	 *
+	 * So here we disable page faults in the iov_iter and then retry if we
+	 * got -EFAULT, faulting in the pages before the retry.
+	 */
+again:
+	from->nofault = true;
+	err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+			   IOMAP_DIO_PARTIAL, written);
+	from->nofault = false;
 
-	if (IS_ERR_OR_NULL(dio)) {
-		err = PTR_ERR_OR_ZERO(dio);
-		if (err < 0 && err != -ENOTBLK)
-			goto out;
-	} else {
-		written = iomap_dio_complete(dio);
+	if (err > 0)
+		written = err;
+
+	if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
+		const size_t left = iov_iter_count(from);
+		/*
+		 * We have more data left to write. Try to fault in as many as
+		 * possible of the remainder pages and retry. We do this without
+		 * releasing and locking again the inode, to prevent races with
+		 * truncate.
+		 *
+		 * Also, in case the iov refers to pages in the file range of the
+		 * file we want to write to (due to a mmap), we could enter an
+		 * infinite loop if we retry after faulting the pages in, since
+		 * iomap will invalidate any pages in the range early on, before
+		 * it tries to fault in the pages of the iov. So we keep track of
+		 * how much was left of iov in the previous EFAULT and fallback
+		 * to buffered IO in case we haven't made any progress.
+		 */
+		if (left == prev_left) {
+			err = -ENOTBLK;
+		} else {
+			fault_in_iov_iter_readable(from, left);
+			prev_left = left;
+			goto again;
+		}
 	}
 
-	if (written < 0 || !iov_iter_count(from)) {
-		err = written;
+	btrfs_inode_unlock(inode, ilock_flags);
+
+	/*
+	 * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
+	 * the fsync (call generic_write_sync()).
+	 */
+	if (is_sync_write)
+		iocb->ki_flags |= IOCB_DSYNC;
+
+	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
+	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
 		goto out;
-	}
 
 buffered:
 	pos = iocb->ki_pos;
@@ -1997,7 +2054,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
 				 endbyte >> PAGE_SHIFT);
 out:
-	return written ? written : err;
+	return err < 0 ? err : written;
 }
 
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
@@ -3649,6 +3706,7 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
 static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t read = 0;
 	ssize_t ret;
 
 	if (fsverity_active(inode))
@@ -3658,10 +3716,48 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		return 0;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
+again:
+	/*
+	 * This is similar to what we do for direct IO writes, see the comment
+	 * at btrfs_direct_write(), but we also disable page faults in addition
+	 * to disabling them only at the iov_iter level. This is because when
+	 * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
+	 * which can still trigger page fault ins despite having set ->nofault
+	 * to true of our 'to' iov_iter.
+	 *
+	 * The difference to direct IO writes is that we deadlock when trying
+	 * to lock the extent range in the inode's tree during he page reads
+	 * triggered by the fault in (while for writes it is due to waiting for
+	 * our own ordered extent). This is because for direct IO reads,
+	 * btrfs_dio_iomap_begin() returns with the extent range locked, which
+	 * is only unlocked in the endio callback (end_bio_extent_readpage()).
+	 */
+	pagefault_disable();
+	to->nofault = true;
 	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			   0, 0);
+			   IOMAP_DIO_PARTIAL, read);
+	to->nofault = false;
+	pagefault_enable();
+
+	if (ret > 0)
+		read = ret;
+
+	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
+		/*
+		 * We have more data left to read. Try to fault in as many as
+		 * possible of the remainder pages and retry.
+		 *
+		 * Unlike for direct IO writes, in case the iov refers to the
+		 * file and range we are reading from (due to a mmap), we don't
+		 * need to worry about an infinite loop (see btrfs_direct_write())
+		 * because iomap does not invalidate pages for reads, only does
+		 * it for writes.
+		 */
+		fault_in_iov_iter_writeable(to, iov_iter_count(to));
+		goto again;
+	}
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
-	return ret;
+	return ret < 0 ? ret : read;
 }
 
 static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-09-08 10:50 [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes fdmanana
@ 2021-09-09 19:21 ` Boris Burkov
  2021-09-10  8:41   ` Filipe Manana
  2021-10-22  5:59 ` Wang Yugui
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2021-09-09 19:21 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Sep 08, 2021 at 11:50:34AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If we do a direct IO read or write when the buffer given by the user is
> memory mapped to the file range we are going to do IO, we end up ending
> in a deadlock. This is triggered by the new test case generic/647 from
> fstests.
> 
> For a direct IO read we get a trace like this:
> 
> [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> [  967.875992] Call Trace:
> [  967.875999]  __schedule+0x3ca/0xe10
> [  967.876015]  schedule+0x43/0xe0
> [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> [  967.876253]  ? lru_cache_add+0x104/0x220
> [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> [  967.876258]  ? sched_clock_cpu+0xd/0x110
> [  967.876263]  ? lock_release+0x155/0x4a0
> [  967.876271]  read_pages+0x86/0x270
> [  967.876274]  ? lru_cache_add+0x125/0x220
> [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> [  967.876291]  filemap_fault+0x626/0xa20
> [  967.876303]  __do_fault+0x36/0xf0
> [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> [  967.876322]  handle_mm_fault+0x9e/0x260
> [  967.876327]  __get_user_pages+0x204/0x620
> [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> [  967.876379]  ? lock_release+0x155/0x4a0
> [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> [  967.876396]  iomap_apply+0x12a/0x4a0
> [  967.876398]  ? iomap_dio_rw+0x30/0x30
> [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> [  967.876415]  ? iomap_dio_rw+0x30/0x30
> [  967.876420]  ? lock_acquired+0xf3/0x420
> [  967.876429]  iomap_dio_rw+0xa/0x30
> [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> [  967.876460]  new_sync_read+0x118/0x1a0
> [  967.876472]  vfs_read+0x128/0x1b0
> [  967.876477]  __x64_sys_pread64+0x90/0xc0
> [  967.876483]  do_syscall_64+0x3b/0xc0
> [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  967.876490] RIP: 0033:0x7fb6f2c038d6
> [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> 
> This happens because at btrfs_dio_iomap_begin() we lock the extent range
> and return with it locked - we only unlock in the endio callback, at
> end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> faults that resulting in reading the pages, through the readahead callback
> btrfs_readahead(), and through there we end to attempt to lock again the
> same extent range (or a subrange of what we locked before), resulting in
> the deadlock.
> 
> For a direct IO write, the scenario is a bit different, and it results in
> trace like this:
> 
> [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> [ 1330.351906] Call Trace:
> [ 1330.351913]  __schedule+0x3ca/0xe10
> [ 1330.351930]  schedule+0x43/0xe0
> [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> [ 1330.352133]  ? lru_cache_add+0x104/0x220
> [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> [ 1330.352143]  ? lock_release+0x155/0x4a0
> [ 1330.352151]  read_pages+0x86/0x270
> [ 1330.352155]  ? lru_cache_add+0x125/0x220
> [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> [ 1330.352172]  filemap_fault+0x626/0xa20
> [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> [ 1330.352184]  __do_fault+0x36/0xf0
> [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> [ 1330.352203]  handle_mm_fault+0x9e/0x260
> [ 1330.352208]  __get_user_pages+0x204/0x620
> [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> [ 1330.352259]  ? lock_release+0x155/0x4a0
> [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> [ 1330.352275]  iomap_apply+0x12a/0x4a0
> [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> [ 1330.352339]  new_sync_write+0x11f/0x1b0
> [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> [ 1330.352354]  vfs_write+0x292/0x3c0
> [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> [ 1330.352365]  do_syscall_64+0x3b/0xc0
> [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 1330.352372] RIP: 0033:0x7f4b0a580986
> [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 
> Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> range unlocked, but later when the page faults are triggered and we try
> to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> we find the ordered extent for our write, created by the iomap callback
> btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> deadlock since we can't complete the ordered extent without reading the
> pages (the iomap code only submits the bio after the pages are faulted
> in).
> 
> Fix this by setting the nofault attribute of the given iov_iter and retry
> the direct IO read/write if we get an -EFAULT error returned from iomap.
> For reads, also disable page faults completely, this is because when we
> read from a hole or a prealloc extent, we can still trigger page faults
> due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> oblivious to the value of the ->nofault attribute of an iov_iter.
> We also need to keep track of the number of bytes written or read, and
> pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> 
> This depends on the iov_iter and iomap changes done by a recent patchset
> from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> moment of this writing. The cover letter has the following subject:
> 
>    "[PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks"
> 
> The thread can be found at:
> 
> https://lore.kernel.org/all/20210827164926.1726765-1-agruenba@redhat.com/
> 
> Fixing these issues could be done without the iov_iter and iomap changes
> introduced in that patchset, however it would be much more complex due to
> the need of reordering some operations for writes and having to be able
> to pass some state through nested and deep call chains, which would be
> particularly cumbersome for reads - for example make the readahead and
> the endio handlers for page reads be aware we are in a direct IO read
> context and know which inode and extent range we locked before.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> As noted in the changelog, this currently depends on an unmerged patchset
> that changes the iov_iter and iomap code. Unfortunately without that
> patchset merged, the solution for this bug would be much more complex
> and hairy.
> 
>  fs/btrfs/file.c | 128 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 112 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 9d41b28c67ba..a020fa5b077c 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1904,16 +1904,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
>  
>  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  {
> +	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file_inode(file);
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	loff_t pos;
>  	ssize_t written = 0;
>  	ssize_t written_buffered;
> +	size_t prev_left = 0;
>  	loff_t endbyte;
>  	ssize_t err;
>  	unsigned int ilock_flags = 0;
> -	struct iomap_dio *dio = NULL;
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		ilock_flags |= BTRFS_ILOCK_TRY;
> @@ -1956,23 +1957,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  		goto buffered;
>  	}
>  
> -	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			     0, 0);
> +	/*
> +	 * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> +	 * calls generic_write_sync() (through iomap_dio_complete()), because
> +	 * that results in calling fsync (btrfs_sync_file()) which will try to
> +	 * lock the inode in exclusive/write mode.
> +	 */
> +	if (is_sync_write)
> +		iocb->ki_flags &= ~IOCB_DSYNC;
>  
> -	btrfs_inode_unlock(inode, ilock_flags);
> +	/*
> +	 * The iov_iter can be mapped to the same file range we are writing to.
> +	 * If that's the case, then we will deadlock in the iomap code, because
> +	 * it first calls our callback btrfs_dio_iomap_begin(), which will create
> +	 * an ordered extent, and after that it will fault in the pages that the
> +	 * iov_iter refers to. During the fault in we end up in the readahead
> +	 * pages code (starting at btrfs_readahead()), which will lock the range,
> +	 * find that ordered extent and then wait for it to complete (at
> +	 * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> +	 * obviously the ordered extent can never complete as we didn't submit
> +	 * yet the respective bio(s). This always happens when the buffer is
> +	 * memory mapped to the same file range, since the iomap DIO code always
> +	 * invalidates pages in the target file range (after starting and waiting
> +	 * for any writeback).

I'm misunderstanding something about this part of the comment. Sorry for
the dumb question:

If the invalidate always triggers the issue, why does it work the second
time after you manually fault them in and try iomap_dio_rw again? I tried
the patches out and traced the calls, and it did indeed work this way
(EFAULT on the first call, OK on the second) but I guess I just don't get
why the invalidate predictably causes the problem only once. I guess the
way you fault it in manually must differ in some crucial way from the
mmap the user does?

Otherwise, this looks like a nice fix and worked as advertised on my
setup. (and deadlocked without the fix)

> +	 *
> +	 * So here we disable page faults in the iov_iter and then retry if we
> +	 * got -EFAULT, faulting in the pages before the retry.
> +	 */
> +again:
> +	from->nofault = true;
> +	err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> +			   IOMAP_DIO_PARTIAL, written);
> +	from->nofault = false;
>  
> -	if (IS_ERR_OR_NULL(dio)) {
> -		err = PTR_ERR_OR_ZERO(dio);
> -		if (err < 0 && err != -ENOTBLK)
> -			goto out;
> -	} else {
> -		written = iomap_dio_complete(dio);
> +	if (err > 0)
> +		written = err;
> +
> +	if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
> +		const size_t left = iov_iter_count(from);
> +		/*
> +		 * We have more data left to write. Try to fault in as many as
> +		 * possible of the remainder pages and retry. We do this without
> +		 * releasing and locking again the inode, to prevent races with
> +		 * truncate.
> +		 *
> +		 * Also, in case the iov refers to pages in the file range of the
> +		 * file we want to write to (due to a mmap), we could enter an
> +		 * infinite loop if we retry after faulting the pages in, since
> +		 * iomap will invalidate any pages in the range early on, before
> +		 * it tries to fault in the pages of the iov. So we keep track of
> +		 * how much was left of iov in the previous EFAULT and fallback
> +		 * to buffered IO in case we haven't made any progress.
> +		 */
> +		if (left == prev_left) {
> +			err = -ENOTBLK;
> +		} else {
> +			fault_in_iov_iter_readable(from, left);
> +			prev_left = left;
> +			goto again;
> +		}
>  	}
>  
> -	if (written < 0 || !iov_iter_count(from)) {
> -		err = written;
> +	btrfs_inode_unlock(inode, ilock_flags);
> +
> +	/*
> +	 * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
> +	 * the fsync (call generic_write_sync()).
> +	 */
> +	if (is_sync_write)
> +		iocb->ki_flags |= IOCB_DSYNC;
> +
> +	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
> +	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
>  		goto out;
> -	}
>  
>  buffered:
>  	pos = iocb->ki_pos;
> @@ -1997,7 +2054,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
>  				 endbyte >> PAGE_SHIFT);
>  out:
> -	return written ? written : err;
> +	return err < 0 ? err : written;
>  }
>  
>  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> @@ -3649,6 +3706,7 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
>  static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	struct inode *inode = file_inode(iocb->ki_filp);
> +	ssize_t read = 0;
>  	ssize_t ret;
>  
>  	if (fsverity_active(inode))
> @@ -3658,10 +3716,48 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  		return 0;
>  
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> +again:
> +	/*
> +	 * This is similar to what we do for direct IO writes, see the comment
> +	 * at btrfs_direct_write(), but we also disable page faults in addition
> +	 * to disabling them only at the iov_iter level. This is because when
> +	 * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
> +	 * which can still trigger page fault ins despite having set ->nofault
> +	 * to true of our 'to' iov_iter.
> +	 *
> +	 * The difference to direct IO writes is that we deadlock when trying
> +	 * to lock the extent range in the inode's tree during he page reads
> +	 * triggered by the fault in (while for writes it is due to waiting for
> +	 * our own ordered extent). This is because for direct IO reads,
> +	 * btrfs_dio_iomap_begin() returns with the extent range locked, which
> +	 * is only unlocked in the endio callback (end_bio_extent_readpage()).
> +	 */
> +	pagefault_disable();
> +	to->nofault = true;
>  	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			   0, 0);
> +			   IOMAP_DIO_PARTIAL, read);
> +	to->nofault = false;
> +	pagefault_enable();
> +
> +	if (ret > 0)
> +		read = ret;
> +
> +	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
> +		/*
> +		 * We have more data left to read. Try to fault in as many as
> +		 * possible of the remainder pages and retry.
> +		 *
> +		 * Unlike for direct IO writes, in case the iov refers to the
> +		 * file and range we are reading from (due to a mmap), we don't
> +		 * need to worry about an infinite loop (see btrfs_direct_write())
> +		 * because iomap does not invalidate pages for reads, only does
> +		 * it for writes.
> +		 */
> +		fault_in_iov_iter_writeable(to, iov_iter_count(to));
> +		goto again;
> +	}
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> -	return ret;
> +	return ret < 0 ? ret : read;
>  }
>  
>  static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> -- 
> 2.33.0
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-09-09 19:21 ` Boris Burkov
@ 2021-09-10  8:41   ` Filipe Manana
  2021-09-10 16:44     ` Boris Burkov
  0 siblings, 1 reply; 18+ messages in thread
From: Filipe Manana @ 2021-09-10  8:41 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On Thu, Sep 9, 2021 at 8:21 PM Boris Burkov <boris@bur.io> wrote:
>
> On Wed, Sep 08, 2021 at 11:50:34AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > If we do a direct IO read or write when the buffer given by the user is
> > memory mapped to the file range we are going to do IO, we end up ending
> > in a deadlock. This is triggered by the new test case generic/647 from
> > fstests.
> >
> > For a direct IO read we get a trace like this:
> >
> > [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> > [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> > [  967.875992] Call Trace:
> > [  967.875999]  __schedule+0x3ca/0xe10
> > [  967.876015]  schedule+0x43/0xe0
> > [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> > [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> > [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> > [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> > [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> > [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> > [  967.876253]  ? lru_cache_add+0x104/0x220
> > [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> > [  967.876258]  ? sched_clock_cpu+0xd/0x110
> > [  967.876263]  ? lock_release+0x155/0x4a0
> > [  967.876271]  read_pages+0x86/0x270
> > [  967.876274]  ? lru_cache_add+0x125/0x220
> > [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> > [  967.876291]  filemap_fault+0x626/0xa20
> > [  967.876303]  __do_fault+0x36/0xf0
> > [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> > [  967.876322]  handle_mm_fault+0x9e/0x260
> > [  967.876327]  __get_user_pages+0x204/0x620
> > [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> > [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> > [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> > [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> > [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> > [  967.876379]  ? lock_release+0x155/0x4a0
> > [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> > [  967.876396]  iomap_apply+0x12a/0x4a0
> > [  967.876398]  ? iomap_dio_rw+0x30/0x30
> > [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> > [  967.876415]  ? iomap_dio_rw+0x30/0x30
> > [  967.876420]  ? lock_acquired+0xf3/0x420
> > [  967.876429]  iomap_dio_rw+0xa/0x30
> > [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> > [  967.876460]  new_sync_read+0x118/0x1a0
> > [  967.876472]  vfs_read+0x128/0x1b0
> > [  967.876477]  __x64_sys_pread64+0x90/0xc0
> > [  967.876483]  do_syscall_64+0x3b/0xc0
> > [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  967.876490] RIP: 0033:0x7fb6f2c038d6
> > [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> > [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> > [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> > [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> > [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> > [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> >
> > This happens because at btrfs_dio_iomap_begin() we lock the extent range
> > and return with it locked - we only unlock in the endio callback, at
> > end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> > iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> > faults that resulting in reading the pages, through the readahead callback
> > btrfs_readahead(), and through there we end to attempt to lock again the
> > same extent range (or a subrange of what we locked before), resulting in
> > the deadlock.
> >
> > For a direct IO write, the scenario is a bit different, and it results in
> > trace like this:
> >
> > [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> > [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> > [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> > [ 1330.351906] Call Trace:
> > [ 1330.351913]  __schedule+0x3ca/0xe10
> > [ 1330.351930]  schedule+0x43/0xe0
> > [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> > [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> > [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> > [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> > [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> > [ 1330.352133]  ? lru_cache_add+0x104/0x220
> > [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> > [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> > [ 1330.352143]  ? lock_release+0x155/0x4a0
> > [ 1330.352151]  read_pages+0x86/0x270
> > [ 1330.352155]  ? lru_cache_add+0x125/0x220
> > [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> > [ 1330.352172]  filemap_fault+0x626/0xa20
> > [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> > [ 1330.352184]  __do_fault+0x36/0xf0
> > [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> > [ 1330.352203]  handle_mm_fault+0x9e/0x260
> > [ 1330.352208]  __get_user_pages+0x204/0x620
> > [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> > [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> > [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> > [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> > [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> > [ 1330.352259]  ? lock_release+0x155/0x4a0
> > [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> > [ 1330.352275]  iomap_apply+0x12a/0x4a0
> > [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> > [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> > [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> > [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> > [ 1330.352339]  new_sync_write+0x11f/0x1b0
> > [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> > [ 1330.352354]  vfs_write+0x292/0x3c0
> > [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> > [ 1330.352365]  do_syscall_64+0x3b/0xc0
> > [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [ 1330.352372] RIP: 0033:0x7f4b0a580986
> > [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> > [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> > [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> > [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> > [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> > [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >
> > Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> > range unlocked, but later when the page faults are triggered and we try
> > to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> > we find the ordered extent for our write, created by the iomap callback
> > btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> > deadlock since we can't complete the ordered extent without reading the
> > pages (the iomap code only submits the bio after the pages are faulted
> > in).
> >
> > Fix this by setting the nofault attribute of the given iov_iter and retry
> > the direct IO read/write if we get an -EFAULT error returned from iomap.
> > For reads, also disable page faults completely, this is because when we
> > read from a hole or a prealloc extent, we can still trigger page faults
> > due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> > oblivious to the value of the ->nofault attribute of an iov_iter.
> > We also need to keep track of the number of bytes written or read, and
> > pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> >
> > This depends on the iov_iter and iomap changes done by a recent patchset
> > from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> > moment of this writing. The cover letter has the following subject:
> >
> >    "[PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks"
> >
> > The thread can be found at:
> >
> > https://lore.kernel.org/all/20210827164926.1726765-1-agruenba@redhat.com/
> >
> > Fixing these issues could be done without the iov_iter and iomap changes
> > introduced in that patchset, however it would be much more complex due to
> > the need of reordering some operations for writes and having to be able
> > to pass some state through nested and deep call chains, which would be
> > particularly cumbersome for reads - for example make the readahead and
> > the endio handlers for page reads be aware we are in a direct IO read
> > context and know which inode and extent range we locked before.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >
> > As noted in the changelog, this currently depends on an unmerged patchset
> > that changes the iov_iter and iomap code. Unfortunately without that
> > patchset merged, the solution for this bug would be much more complex
> > and hairy.
> >
> >  fs/btrfs/file.c | 128 ++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 112 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 9d41b28c67ba..a020fa5b077c 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1904,16 +1904,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
> >
> >  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >  {
> > +     const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
> >       struct file *file = iocb->ki_filp;
> >       struct inode *inode = file_inode(file);
> >       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >       loff_t pos;
> >       ssize_t written = 0;
> >       ssize_t written_buffered;
> > +     size_t prev_left = 0;
> >       loff_t endbyte;
> >       ssize_t err;
> >       unsigned int ilock_flags = 0;
> > -     struct iomap_dio *dio = NULL;
> >
> >       if (iocb->ki_flags & IOCB_NOWAIT)
> >               ilock_flags |= BTRFS_ILOCK_TRY;
> > @@ -1956,23 +1957,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >               goto buffered;
> >       }
> >
> > -     dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > -                          0, 0);
> > +     /*
> > +      * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> > +      * calls generic_write_sync() (through iomap_dio_complete()), because
> > +      * that results in calling fsync (btrfs_sync_file()) which will try to
> > +      * lock the inode in exclusive/write mode.
> > +      */
> > +     if (is_sync_write)
> > +             iocb->ki_flags &= ~IOCB_DSYNC;
> >
> > -     btrfs_inode_unlock(inode, ilock_flags);
> > +     /*
> > +      * The iov_iter can be mapped to the same file range we are writing to.
> > +      * If that's the case, then we will deadlock in the iomap code, because
> > +      * it first calls our callback btrfs_dio_iomap_begin(), which will create
> > +      * an ordered extent, and after that it will fault in the pages that the
> > +      * iov_iter refers to. During the fault in we end up in the readahead
> > +      * pages code (starting at btrfs_readahead()), which will lock the range,
> > +      * find that ordered extent and then wait for it to complete (at
> > +      * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> > +      * obviously the ordered extent can never complete as we didn't submit
> > +      * yet the respective bio(s). This always happens when the buffer is
> > +      * memory mapped to the same file range, since the iomap DIO code always
> > +      * invalidates pages in the target file range (after starting and waiting
> > +      * for any writeback).
>
> I'm misunderstanding something about this part of the comment. Sorry for
> the dumb question:

It's not dumb at all, it's a very good observation.

>
> If the invalidate always triggers the issue, why does it work the second
> time after you manually fault them in and try iomap_dio_rw again? I tried
> the patches out and traced the calls, and it did indeed work this way
> (EFAULT on the first call, OK on the second) but I guess I just don't get
> why the invalidate predictably causes the problem only once. I guess the
> way you fault it in manually must differ in some crucial way from the
> mmap the user does?

So with generic/647, I also observed what you experienced - the first
retry seems to always succeed.
This is correlated to the fact the test does a 4K write only.

Try with a 1M write for example, or larger, and you're likely to get
into an "infinite" loop like I did.
I say "infinite" because for the 1M case in my test vm it's not really
infinite, but it takes over 60 seconds to complete.
But in theory it can be infinite - that is excessive and falling back
to a buffered write if we don't make any progress is much faster.

So a retry often actually works because the page invalidation done by
the iomap code fails to release pages.
This is because:

1) When faulting in pages, we go to btrfs' readahead and page read
code, where we get locked pages, then lock the extent ranges, and then
submit the bio(s);

2) When the bio completes, the end io callback for page reads is run -
end_bio_extent_readpage() - this runs in a separate task/workqueue;

3) There we unlock a page and after that we unlock the extent range;

4) As soon as the page is unlocked, the task that faulted in a page is
woken up and resumes doing its stuff - in this case it's the dio write
task;

5) So the dio task calls iomap which in turn attempts to invalidate
the pages in the range - this ends calling btrfs' page release
callback (btrfs_releasepage()).
    Through this call chain we end up at calling
try_release_extent_state(), which makes btrfs_releasepage() return 0
(can't release page) if the extent range is currently still locked -
the task calling  end_bio_extent_readpage() has not yet unlocked the
extent range (but has already unlocked the page).

So that's why the invalidation sometimes is not able to release pages,
and the retries work the very first time.
If we ever change end_bio_extent_readpage() to unlock the extent range
before unlocking a page, then the page invalidation/release should
always work, resulting in such an infinite loop.

So it's all about this specific timing. With large writes, covering a
mmap'ed range with many pages, I run into those long loops that seem
like infinite - and they might be for much larger writes - certainly
it's nor desirable at all to have a 1M write take 60 seconds for
example.

>
> Otherwise, this looks like a nice fix and worked as advertised on my
> setup. (and deadlocked without the fix)

Thanks for testing it and reading it.

>
> > +      *
> > +      * So here we disable page faults in the iov_iter and then retry if we
> > +      * got -EFAULT, faulting in the pages before the retry.
> > +      */
> > +again:
> > +     from->nofault = true;
> > +     err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > +                        IOMAP_DIO_PARTIAL, written);
> > +     from->nofault = false;
> >
> > -     if (IS_ERR_OR_NULL(dio)) {
> > -             err = PTR_ERR_OR_ZERO(dio);
> > -             if (err < 0 && err != -ENOTBLK)
> > -                     goto out;
> > -     } else {
> > -             written = iomap_dio_complete(dio);
> > +     if (err > 0)
> > +             written = err;
> > +
> > +     if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
> > +             const size_t left = iov_iter_count(from);
> > +             /*
> > +              * We have more data left to write. Try to fault in as many as
> > +              * possible of the remainder pages and retry. We do this without
> > +              * releasing and locking again the inode, to prevent races with
> > +              * truncate.
> > +              *
> > +              * Also, in case the iov refers to pages in the file range of the
> > +              * file we want to write to (due to a mmap), we could enter an
> > +              * infinite loop if we retry after faulting the pages in, since
> > +              * iomap will invalidate any pages in the range early on, before
> > +              * it tries to fault in the pages of the iov. So we keep track of
> > +              * how much was left of iov in the previous EFAULT and fallback
> > +              * to buffered IO in case we haven't made any progress.
> > +              */
> > +             if (left == prev_left) {
> > +                     err = -ENOTBLK;
> > +             } else {
> > +                     fault_in_iov_iter_readable(from, left);
> > +                     prev_left = left;
> > +                     goto again;
> > +             }
> >       }
> >
> > -     if (written < 0 || !iov_iter_count(from)) {
> > -             err = written;
> > +     btrfs_inode_unlock(inode, ilock_flags);
> > +
> > +     /*
> > +      * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
> > +      * the fsync (call generic_write_sync()).
> > +      */
> > +     if (is_sync_write)
> > +             iocb->ki_flags |= IOCB_DSYNC;
> > +
> > +     /* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
> > +     if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
> >               goto out;
> > -     }
> >
> >  buffered:
> >       pos = iocb->ki_pos;
> > @@ -1997,7 +2054,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >       invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
> >                                endbyte >> PAGE_SHIFT);
> >  out:
> > -     return written ? written : err;
> > +     return err < 0 ? err : written;
> >  }
> >
> >  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > @@ -3649,6 +3706,7 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
> >  static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> >  {
> >       struct inode *inode = file_inode(iocb->ki_filp);
> > +     ssize_t read = 0;
> >       ssize_t ret;
> >
> >       if (fsverity_active(inode))
> > @@ -3658,10 +3716,48 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> >               return 0;
> >
> >       btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> > +again:
> > +     /*
> > +      * This is similar to what we do for direct IO writes, see the comment
> > +      * at btrfs_direct_write(), but we also disable page faults in addition
> > +      * to disabling them only at the iov_iter level. This is because when
> > +      * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
> > +      * which can still trigger page fault ins despite having set ->nofault
> > +      * to true of our 'to' iov_iter.
> > +      *
> > +      * The difference to direct IO writes is that we deadlock when trying
> > +      * to lock the extent range in the inode's tree during he page reads
> > +      * triggered by the fault in (while for writes it is due to waiting for
> > +      * our own ordered extent). This is because for direct IO reads,
> > +      * btrfs_dio_iomap_begin() returns with the extent range locked, which
> > +      * is only unlocked in the endio callback (end_bio_extent_readpage()).
> > +      */
> > +     pagefault_disable();
> > +     to->nofault = true;
> >       ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > -                        0, 0);
> > +                        IOMAP_DIO_PARTIAL, read);
> > +     to->nofault = false;
> > +     pagefault_enable();
> > +
> > +     if (ret > 0)
> > +             read = ret;
> > +
> > +     if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
> > +             /*
> > +              * We have more data left to read. Try to fault in as many as
> > +              * possible of the remainder pages and retry.
> > +              *
> > +              * Unlike for direct IO writes, in case the iov refers to the
> > +              * file and range we are reading from (due to a mmap), we don't
> > +              * need to worry about an infinite loop (see btrfs_direct_write())
> > +              * because iomap does not invalidate pages for reads, only does
> > +              * it for writes.
> > +              */
> > +             fault_in_iov_iter_writeable(to, iov_iter_count(to));
> > +             goto again;
> > +     }
> >       btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> > -     return ret;
> > +     return ret < 0 ? ret : read;
> >  }
> >
> >  static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > --
> > 2.33.0
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-09-10  8:41   ` Filipe Manana
@ 2021-09-10 16:44     ` Boris Burkov
  0 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-09-10 16:44 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Fri, Sep 10, 2021 at 09:41:33AM +0100, Filipe Manana wrote:
> On Thu, Sep 9, 2021 at 8:21 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Wed, Sep 08, 2021 at 11:50:34AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > If we do a direct IO read or write when the buffer given by the user is
> > > memory mapped to the file range we are going to do IO, we end up ending
> > > in a deadlock. This is triggered by the new test case generic/647 from
> > > fstests.
> > >
> > > For a direct IO read we get a trace like this:
> > >
> > > [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> > > [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > > [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> > > [  967.875992] Call Trace:
> > > [  967.875999]  __schedule+0x3ca/0xe10
> > > [  967.876015]  schedule+0x43/0xe0
> > > [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> > > [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> > > [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> > > [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> > > [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> > > [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> > > [  967.876253]  ? lru_cache_add+0x104/0x220
> > > [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> > > [  967.876258]  ? sched_clock_cpu+0xd/0x110
> > > [  967.876263]  ? lock_release+0x155/0x4a0
> > > [  967.876271]  read_pages+0x86/0x270
> > > [  967.876274]  ? lru_cache_add+0x125/0x220
> > > [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> > > [  967.876291]  filemap_fault+0x626/0xa20
> > > [  967.876303]  __do_fault+0x36/0xf0
> > > [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> > > [  967.876322]  handle_mm_fault+0x9e/0x260
> > > [  967.876327]  __get_user_pages+0x204/0x620
> > > [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> > > [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> > > [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> > > [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> > > [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> > > [  967.876379]  ? lock_release+0x155/0x4a0
> > > [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> > > [  967.876396]  iomap_apply+0x12a/0x4a0
> > > [  967.876398]  ? iomap_dio_rw+0x30/0x30
> > > [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> > > [  967.876415]  ? iomap_dio_rw+0x30/0x30
> > > [  967.876420]  ? lock_acquired+0xf3/0x420
> > > [  967.876429]  iomap_dio_rw+0xa/0x30
> > > [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> > > [  967.876460]  new_sync_read+0x118/0x1a0
> > > [  967.876472]  vfs_read+0x128/0x1b0
> > > [  967.876477]  __x64_sys_pread64+0x90/0xc0
> > > [  967.876483]  do_syscall_64+0x3b/0xc0
> > > [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [  967.876490] RIP: 0033:0x7fb6f2c038d6
> > > [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> > > [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> > > [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> > > [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> > > [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> > > [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> > >
> > > This happens because at btrfs_dio_iomap_begin() we lock the extent range
> > > and return with it locked - we only unlock in the endio callback, at
> > > end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> > > iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> > > faults that resulting in reading the pages, through the readahead callback
> > > btrfs_readahead(), and through there we end to attempt to lock again the
> > > same extent range (or a subrange of what we locked before), resulting in
> > > the deadlock.
> > >
> > > For a direct IO write, the scenario is a bit different, and it results in
> > > trace like this:
> > >
> > > [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> > > [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> > > [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > > [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> > > [ 1330.351906] Call Trace:
> > > [ 1330.351913]  __schedule+0x3ca/0xe10
> > > [ 1330.351930]  schedule+0x43/0xe0
> > > [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> > > [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> > > [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> > > [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> > > [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> > > [ 1330.352133]  ? lru_cache_add+0x104/0x220
> > > [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> > > [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> > > [ 1330.352143]  ? lock_release+0x155/0x4a0
> > > [ 1330.352151]  read_pages+0x86/0x270
> > > [ 1330.352155]  ? lru_cache_add+0x125/0x220
> > > [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> > > [ 1330.352172]  filemap_fault+0x626/0xa20
> > > [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> > > [ 1330.352184]  __do_fault+0x36/0xf0
> > > [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> > > [ 1330.352203]  handle_mm_fault+0x9e/0x260
> > > [ 1330.352208]  __get_user_pages+0x204/0x620
> > > [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> > > [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> > > [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> > > [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> > > [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> > > [ 1330.352259]  ? lock_release+0x155/0x4a0
> > > [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> > > [ 1330.352275]  iomap_apply+0x12a/0x4a0
> > > [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> > > [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> > > [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> > > [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> > > [ 1330.352339]  new_sync_write+0x11f/0x1b0
> > > [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> > > [ 1330.352354]  vfs_write+0x292/0x3c0
> > > [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> > > [ 1330.352365]  do_syscall_64+0x3b/0xc0
> > > [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [ 1330.352372] RIP: 0033:0x7f4b0a580986
> > > [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> > > [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> > > [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> > > [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> > > [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> > > [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > >
> > > Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> > > range unlocked, but later when the page faults are triggered and we try
> > > to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> > > we find the ordered extent for our write, created by the iomap callback
> > > btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> > > deadlock since we can't complete the ordered extent without reading the
> > > pages (the iomap code only submits the bio after the pages are faulted
> > > in).
> > >
> > > Fix this by setting the nofault attribute of the given iov_iter and retry
> > > the direct IO read/write if we get an -EFAULT error returned from iomap.
> > > For reads, also disable page faults completely, this is because when we
> > > read from a hole or a prealloc extent, we can still trigger page faults
> > > due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> > > oblivious to the value of the ->nofault attribute of an iov_iter.
> > > We also need to keep track of the number of bytes written or read, and
> > > pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> > >
> > > This depends on the iov_iter and iomap changes done by a recent patchset
> > > from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> > > moment of this writing. The cover letter has the following subject:
> > >
> > >    "[PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks"
> > >
> > > The thread can be found at:
> > >
> > > https://lore.kernel.org/all/20210827164926.1726765-1-agruenba@redhat.com/
> > >
> > > Fixing these issues could be done without the iov_iter and iomap changes
> > > introduced in that patchset, however it would be much more complex due to
> > > the need of reordering some operations for writes and having to be able
> > > to pass some state through nested and deep call chains, which would be
> > > particularly cumbersome for reads - for example make the readahead and
> > > the endio handlers for page reads be aware we are in a direct IO read
> > > context and know which inode and extent range we locked before.
> > >
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Boris Burkov <boris@bur.io>

> > > ---
> > >
> > > As noted in the changelog, this currently depends on an unmerged patchset
> > > that changes the iov_iter and iomap code. Unfortunately without that
> > > patchset merged, the solution for this bug would be much more complex
> > > and hairy.
> > >
> > >  fs/btrfs/file.c | 128 ++++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 112 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index 9d41b28c67ba..a020fa5b077c 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -1904,16 +1904,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
> > >
> > >  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> > >  {
> > > +     const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
> > >       struct file *file = iocb->ki_filp;
> > >       struct inode *inode = file_inode(file);
> > >       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > >       loff_t pos;
> > >       ssize_t written = 0;
> > >       ssize_t written_buffered;
> > > +     size_t prev_left = 0;
> > >       loff_t endbyte;
> > >       ssize_t err;
> > >       unsigned int ilock_flags = 0;
> > > -     struct iomap_dio *dio = NULL;
> > >
> > >       if (iocb->ki_flags & IOCB_NOWAIT)
> > >               ilock_flags |= BTRFS_ILOCK_TRY;
> > > @@ -1956,23 +1957,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> > >               goto buffered;
> > >       }
> > >
> > > -     dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > > -                          0, 0);
> > > +     /*
> > > +      * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> > > +      * calls generic_write_sync() (through iomap_dio_complete()), because
> > > +      * that results in calling fsync (btrfs_sync_file()) which will try to
> > > +      * lock the inode in exclusive/write mode.
> > > +      */
> > > +     if (is_sync_write)
> > > +             iocb->ki_flags &= ~IOCB_DSYNC;
> > >
> > > -     btrfs_inode_unlock(inode, ilock_flags);
> > > +     /*
> > > +      * The iov_iter can be mapped to the same file range we are writing to.
> > > +      * If that's the case, then we will deadlock in the iomap code, because
> > > +      * it first calls our callback btrfs_dio_iomap_begin(), which will create
> > > +      * an ordered extent, and after that it will fault in the pages that the
> > > +      * iov_iter refers to. During the fault in we end up in the readahead
> > > +      * pages code (starting at btrfs_readahead()), which will lock the range,
> > > +      * find that ordered extent and then wait for it to complete (at
> > > +      * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> > > +      * obviously the ordered extent can never complete as we didn't submit
> > > +      * yet the respective bio(s). This always happens when the buffer is
> > > +      * memory mapped to the same file range, since the iomap DIO code always
> > > +      * invalidates pages in the target file range (after starting and waiting
> > > +      * for any writeback).
> >
> > I'm misunderstanding something about this part of the comment. Sorry for
> > the dumb question:
> 
> It's not dumb at all, it's a very good observation.
> 
> >
> > If the invalidate always triggers the issue, why does it work the second
> > time after you manually fault them in and try iomap_dio_rw again? I tried
> > the patches out and traced the calls, and it did indeed work this way
> > (EFAULT on the first call, OK on the second) but I guess I just don't get
> > why the invalidate predictably causes the problem only once. I guess the
> > way you fault it in manually must differ in some crucial way from the
> > mmap the user does?
> 
> So with generic/647, I also observed what you experienced - the first
> retry seems to always succeed.
> This is correlated to the fact the test does a 4K write only.
> 
> Try with a 1M write for example, or larger, and you're likely to get
> into an "infinite" loop like I did.
> I say "infinite" because for the 1M case in my test vm it's not really
> infinite, but it takes over 60 seconds to complete.
> But in theory it can be infinite - that is excessive and falling back
> to a buffered write if we don't make any progress is much faster.
> 
> So a retry often actually works because the page invalidation done by
> the iomap code fails to release pages.
> This is because:
> 
> 1) When faulting in pages, we go to btrfs' readahead and page read
> code, where we get locked pages, then lock the extent ranges, and then
> submit the bio(s);
> 
> 2) When the bio completes, the end io callback for page reads is run -
> end_bio_extent_readpage() - this runs in a separate task/workqueue;
> 
> 3) There we unlock a page and after that we unlock the extent range;
> 
> 4) As soon as the page is unlocked, the task that faulted in a page is
> woken up and resumes doing its stuff - in this case it's the dio write
> task;
> 
> 5) So the dio task calls iomap which in turn attempts to invalidate
> the pages in the range - this ends calling btrfs' page release
> callback (btrfs_releasepage()).
>     Through this call chain we end up at calling
> try_release_extent_state(), which makes btrfs_releasepage() return 0
> (can't release page) if the extent range is currently still locked -
> the task calling  end_bio_extent_readpage() has not yet unlocked the
> extent range (but has already unlocked the page).
> 
> So that's why the invalidation sometimes is not able to release pages,
> and the retries work the very first time.
> If we ever change end_bio_extent_readpage() to unlock the extent range
> before unlocking a page, then the page invalidation/release should
> always work, resulting in such an infinite loop.
> 
> So it's all about this specific timing. With large writes, covering a
> mmap'ed range with many pages, I run into those long loops that seem
> like infinite - and they might be for much larger writes - certainly
> it's nor desirable at all to have a 1M write take 60 seconds for
> example.

Aha, that makes sense. Thanks for the extra explanation!

> 
> >
> > Otherwise, this looks like a nice fix and worked as advertised on my
> > setup. (and deadlocked without the fix)
> 
> Thanks for testing it and reading it.
> 
> >
> > > +      *
> > > +      * So here we disable page faults in the iov_iter and then retry if we
> > > +      * got -EFAULT, faulting in the pages before the retry.
> > > +      */
> > > +again:
> > > +     from->nofault = true;
> > > +     err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > > +                        IOMAP_DIO_PARTIAL, written);
> > > +     from->nofault = false;
> > >
> > > -     if (IS_ERR_OR_NULL(dio)) {
> > > -             err = PTR_ERR_OR_ZERO(dio);
> > > -             if (err < 0 && err != -ENOTBLK)
> > > -                     goto out;
> > > -     } else {
> > > -             written = iomap_dio_complete(dio);
> > > +     if (err > 0)
> > > +             written = err;
> > > +
> > > +     if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
> > > +             const size_t left = iov_iter_count(from);
> > > +             /*
> > > +              * We have more data left to write. Try to fault in as many as
> > > +              * possible of the remainder pages and retry. We do this without
> > > +              * releasing and locking again the inode, to prevent races with
> > > +              * truncate.
> > > +              *
> > > +              * Also, in case the iov refers to pages in the file range of the
> > > +              * file we want to write to (due to a mmap), we could enter an
> > > +              * infinite loop if we retry after faulting the pages in, since
> > > +              * iomap will invalidate any pages in the range early on, before
> > > +              * it tries to fault in the pages of the iov. So we keep track of
> > > +              * how much was left of iov in the previous EFAULT and fallback
> > > +              * to buffered IO in case we haven't made any progress.
> > > +              */
> > > +             if (left == prev_left) {
> > > +                     err = -ENOTBLK;
> > > +             } else {
> > > +                     fault_in_iov_iter_readable(from, left);
> > > +                     prev_left = left;
> > > +                     goto again;
> > > +             }
> > >       }
> > >
> > > -     if (written < 0 || !iov_iter_count(from)) {
> > > -             err = written;
> > > +     btrfs_inode_unlock(inode, ilock_flags);
> > > +
> > > +     /*
> > > +      * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
> > > +      * the fsync (call generic_write_sync()).
> > > +      */
> > > +     if (is_sync_write)
> > > +             iocb->ki_flags |= IOCB_DSYNC;
> > > +
> > > +     /* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
> > > +     if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
> > >               goto out;
> > > -     }
> > >
> > >  buffered:
> > >       pos = iocb->ki_pos;
> > > @@ -1997,7 +2054,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> > >       invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
> > >                                endbyte >> PAGE_SHIFT);
> > >  out:
> > > -     return written ? written : err;
> > > +     return err < 0 ? err : written;
> > >  }
> > >
> > >  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > > @@ -3649,6 +3706,7 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
> > >  static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> > >  {
> > >       struct inode *inode = file_inode(iocb->ki_filp);
> > > +     ssize_t read = 0;
> > >       ssize_t ret;
> > >
> > >       if (fsverity_active(inode))
> > > @@ -3658,10 +3716,48 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> > >               return 0;
> > >
> > >       btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> > > +again:
> > > +     /*
> > > +      * This is similar to what we do for direct IO writes, see the comment
> > > +      * at btrfs_direct_write(), but we also disable page faults in addition
> > > +      * to disabling them only at the iov_iter level. This is because when
> > > +      * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
> > > +      * which can still trigger page fault ins despite having set ->nofault
> > > +      * to true of our 'to' iov_iter.
> > > +      *
> > > +      * The difference to direct IO writes is that we deadlock when trying
> > > +      * to lock the extent range in the inode's tree during he page reads
> > > +      * triggered by the fault in (while for writes it is due to waiting for
> > > +      * our own ordered extent). This is because for direct IO reads,
> > > +      * btrfs_dio_iomap_begin() returns with the extent range locked, which
> > > +      * is only unlocked in the endio callback (end_bio_extent_readpage()).
> > > +      */
> > > +     pagefault_disable();
> > > +     to->nofault = true;
> > >       ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > > -                        0, 0);
> > > +                        IOMAP_DIO_PARTIAL, read);
> > > +     to->nofault = false;
> > > +     pagefault_enable();
> > > +
> > > +     if (ret > 0)
> > > +             read = ret;
> > > +
> > > +     if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
> > > +             /*
> > > +              * We have more data left to read. Try to fault in as many as
> > > +              * possible of the remainder pages and retry.
> > > +              *
> > > +              * Unlike for direct IO writes, in case the iov refers to the
> > > +              * file and range we are reading from (due to a mmap), we don't
> > > +              * need to worry about an infinite loop (see btrfs_direct_write())
> > > +              * because iomap does not invalidate pages for reads, only does
> > > +              * it for writes.
> > > +              */
> > > +             fault_in_iov_iter_writeable(to, iov_iter_count(to));
> > > +             goto again;
> > > +     }
> > >       btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> > > -     return ret;
> > > +     return ret < 0 ? ret : read;
> > >  }
> > >
> > >  static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > --
> > > 2.33.0
> > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-09-08 10:50 [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes fdmanana
  2021-09-09 19:21 ` Boris Burkov
@ 2021-10-22  5:59 ` Wang Yugui
  2021-10-22 10:54   ` Filipe Manana
  2021-10-25  9:42 ` [PATCH v2] " fdmanana
  2021-10-25 16:27 ` [PATCH v3] " fdmanana
  3 siblings, 1 reply; 18+ messages in thread
From: Wang Yugui @ 2021-10-22  5:59 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

Hi,

I noticed a infinite loop of fstests/generic/475 when I apply this patch
and "[PATCH v9 00/17] gfs2: Fix mmap + page fault deadlocks"

reproduce frequency: about 50%.

Call Trace 1:
Oct 22 06:13:06 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid:2652125 ppid:     1 flags:0x00004006
Oct 22 06:13:06 T7610 kernel: Call Trace:
Oct 22 06:13:06 T7610 kernel: ? iomap_dio_rw+0xa/0x30
Oct 22 06:13:06 T7610 kernel: ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
Oct 22 06:13:06 T7610 kernel: ? new_sync_read+0x118/0x1a0
Oct 22 06:13:06 T7610 kernel: ? vfs_read+0xf1/0x190
Oct 22 06:13:06 T7610 kernel: ? ksys_read+0x59/0xd0
Oct 22 06:13:06 T7610 kernel: ? do_syscall_64+0x37/0x80
Oct 22 06:13:06 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae


Call Trace 2:
Oct 22 07:45:37 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid: 9584 ppid:     1 flags:0x00004006
Oct 22 07:45:37 T7610 kernel: Call Trace:
Oct 22 07:45:37 T7610 kernel: ? iomap_dio_complete+0x9e/0x140
Oct 22 07:45:37 T7610 kernel: ? btrfs_file_read_iter+0x124/0x1c0 [btrfs]
Oct 22 07:45:37 T7610 kernel: ? new_sync_read+0x118/0x1a0
Oct 22 07:45:37 T7610 kernel: ? vfs_read+0xf1/0x190
Oct 22 07:45:37 T7610 kernel: ? ksys_read+0x59/0xd0
Oct 22 07:45:37 T7610 kernel: ? do_syscall_64+0x37/0x80
Oct 22 07:45:37 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae


We noticed some  error in dmesg before this infinite loop.
[15590.720909] BTRFS: error (device dm-0) in __btrfs_free_extent:3069: errno=-5 IO failure
[15590.723014] BTRFS info (device dm-0): forced readonly
[15590.725115] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2150: errno=-5 IO failure

[  293.404607] BTRFS error (device dm-0): error writing primary super block to device 1
[  293.405872] BTRFS: error (device dm-0) in write_all_supers:4220: errno=-5 IO failure (1 errors while writing supers)
[  293.407112] BTRFS info (device dm-0): forced readonly
[  293.408225] Buffer I/O error on dev dm-0, logical block 3670000, async page read
[  293.408378] BTRFS: error (device dm-0) in cleanup_transaction:1945: errno=-5 IO failure
[  293.411037] BTRFS: error (device dm-0) in cleanup_transaction:1945: errno=-5 IO failure

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/10/22

> From: Filipe Manana <fdmanana@suse.com>
> 
> If we do a direct IO read or write when the buffer given by the user is
> memory mapped to the file range we are going to do IO, we end up ending
> in a deadlock. This is triggered by the new test case generic/647 from
> fstests.
> 
> For a direct IO read we get a trace like this:
> 
> [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> [  967.875992] Call Trace:
> [  967.875999]  __schedule+0x3ca/0xe10
> [  967.876015]  schedule+0x43/0xe0
> [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> [  967.876253]  ? lru_cache_add+0x104/0x220
> [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> [  967.876258]  ? sched_clock_cpu+0xd/0x110
> [  967.876263]  ? lock_release+0x155/0x4a0
> [  967.876271]  read_pages+0x86/0x270
> [  967.876274]  ? lru_cache_add+0x125/0x220
> [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> [  967.876291]  filemap_fault+0x626/0xa20
> [  967.876303]  __do_fault+0x36/0xf0
> [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> [  967.876322]  handle_mm_fault+0x9e/0x260
> [  967.876327]  __get_user_pages+0x204/0x620
> [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> [  967.876379]  ? lock_release+0x155/0x4a0
> [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> [  967.876396]  iomap_apply+0x12a/0x4a0
> [  967.876398]  ? iomap_dio_rw+0x30/0x30
> [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> [  967.876415]  ? iomap_dio_rw+0x30/0x30
> [  967.876420]  ? lock_acquired+0xf3/0x420
> [  967.876429]  iomap_dio_rw+0xa/0x30
> [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> [  967.876460]  new_sync_read+0x118/0x1a0
> [  967.876472]  vfs_read+0x128/0x1b0
> [  967.876477]  __x64_sys_pread64+0x90/0xc0
> [  967.876483]  do_syscall_64+0x3b/0xc0
> [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  967.876490] RIP: 0033:0x7fb6f2c038d6
> [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> 
> This happens because at btrfs_dio_iomap_begin() we lock the extent range
> and return with it locked - we only unlock in the endio callback, at
> end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> faults that resulting in reading the pages, through the readahead callback
> btrfs_readahead(), and through there we end to attempt to lock again the
> same extent range (or a subrange of what we locked before), resulting in
> the deadlock.
> 
> For a direct IO write, the scenario is a bit different, and it results in
> trace like this:
> 
> [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> [ 1330.351906] Call Trace:
> [ 1330.351913]  __schedule+0x3ca/0xe10
> [ 1330.351930]  schedule+0x43/0xe0
> [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> [ 1330.352133]  ? lru_cache_add+0x104/0x220
> [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> [ 1330.352143]  ? lock_release+0x155/0x4a0
> [ 1330.352151]  read_pages+0x86/0x270
> [ 1330.352155]  ? lru_cache_add+0x125/0x220
> [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> [ 1330.352172]  filemap_fault+0x626/0xa20
> [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> [ 1330.352184]  __do_fault+0x36/0xf0
> [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> [ 1330.352203]  handle_mm_fault+0x9e/0x260
> [ 1330.352208]  __get_user_pages+0x204/0x620
> [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> [ 1330.352259]  ? lock_release+0x155/0x4a0
> [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> [ 1330.352275]  iomap_apply+0x12a/0x4a0
> [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> [ 1330.352339]  new_sync_write+0x11f/0x1b0
> [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> [ 1330.352354]  vfs_write+0x292/0x3c0
> [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> [ 1330.352365]  do_syscall_64+0x3b/0xc0
> [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 1330.352372] RIP: 0033:0x7f4b0a580986
> [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 
> Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> range unlocked, but later when the page faults are triggered and we try
> to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> we find the ordered extent for our write, created by the iomap callback
> btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> deadlock since we can't complete the ordered extent without reading the
> pages (the iomap code only submits the bio after the pages are faulted
> in).
> 
> Fix this by setting the nofault attribute of the given iov_iter and retry
> the direct IO read/write if we get an -EFAULT error returned from iomap.
> For reads, also disable page faults completely, this is because when we
> read from a hole or a prealloc extent, we can still trigger page faults
> due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> oblivious to the value of the ->nofault attribute of an iov_iter.
> We also need to keep track of the number of bytes written or read, and
> pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> 
> This depends on the iov_iter and iomap changes done by a recent patchset
> from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> moment of this writing. The cover letter has the following subject:
> 
>    "[PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks"
> 
> The thread can be found at:
> 
> https://lore.kernel.org/all/20210827164926.1726765-1-agruenba@redhat.com/
> 
> Fixing these issues could be done without the iov_iter and iomap changes
> introduced in that patchset, however it would be much more complex due to
> the need of reordering some operations for writes and having to be able
> to pass some state through nested and deep call chains, which would be
> particularly cumbersome for reads - for example make the readahead and
> the endio handlers for page reads be aware we are in a direct IO read
> context and know which inode and extent range we locked before.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> As noted in the changelog, this currently depends on an unmerged patchset
> that changes the iov_iter and iomap code. Unfortunately without that
> patchset merged, the solution for this bug would be much more complex
> and hairy.
> 
>  fs/btrfs/file.c | 128 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 112 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 9d41b28c67ba..a020fa5b077c 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1904,16 +1904,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
>  
>  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  {
> +	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file_inode(file);
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	loff_t pos;
>  	ssize_t written = 0;
>  	ssize_t written_buffered;
> +	size_t prev_left = 0;
>  	loff_t endbyte;
>  	ssize_t err;
>  	unsigned int ilock_flags = 0;
> -	struct iomap_dio *dio = NULL;
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		ilock_flags |= BTRFS_ILOCK_TRY;
> @@ -1956,23 +1957,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  		goto buffered;
>  	}
>  
> -	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			     0, 0);
> +	/*
> +	 * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> +	 * calls generic_write_sync() (through iomap_dio_complete()), because
> +	 * that results in calling fsync (btrfs_sync_file()) which will try to
> +	 * lock the inode in exclusive/write mode.
> +	 */
> +	if (is_sync_write)
> +		iocb->ki_flags &= ~IOCB_DSYNC;
>  
> -	btrfs_inode_unlock(inode, ilock_flags);
> +	/*
> +	 * The iov_iter can be mapped to the same file range we are writing to.
> +	 * If that's the case, then we will deadlock in the iomap code, because
> +	 * it first calls our callback btrfs_dio_iomap_begin(), which will create
> +	 * an ordered extent, and after that it will fault in the pages that the
> +	 * iov_iter refers to. During the fault in we end up in the readahead
> +	 * pages code (starting at btrfs_readahead()), which will lock the range,
> +	 * find that ordered extent and then wait for it to complete (at
> +	 * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> +	 * obviously the ordered extent can never complete as we didn't submit
> +	 * yet the respective bio(s). This always happens when the buffer is
> +	 * memory mapped to the same file range, since the iomap DIO code always
> +	 * invalidates pages in the target file range (after starting and waiting
> +	 * for any writeback).
> +	 *
> +	 * So here we disable page faults in the iov_iter and then retry if we
> +	 * got -EFAULT, faulting in the pages before the retry.
> +	 */
> +again:
> +	from->nofault = true;
> +	err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> +			   IOMAP_DIO_PARTIAL, written);
> +	from->nofault = false;
>  
> -	if (IS_ERR_OR_NULL(dio)) {
> -		err = PTR_ERR_OR_ZERO(dio);
> -		if (err < 0 && err != -ENOTBLK)
> -			goto out;
> -	} else {
> -		written = iomap_dio_complete(dio);
> +	if (err > 0)
> +		written = err;
> +
> +	if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
> +		const size_t left = iov_iter_count(from);
> +		/*
> +		 * We have more data left to write. Try to fault in as many as
> +		 * possible of the remainder pages and retry. We do this without
> +		 * releasing and locking again the inode, to prevent races with
> +		 * truncate.
> +		 *
> +		 * Also, in case the iov refers to pages in the file range of the
> +		 * file we want to write to (due to a mmap), we could enter an
> +		 * infinite loop if we retry after faulting the pages in, since
> +		 * iomap will invalidate any pages in the range early on, before
> +		 * it tries to fault in the pages of the iov. So we keep track of
> +		 * how much was left of iov in the previous EFAULT and fallback
> +		 * to buffered IO in case we haven't made any progress.
> +		 */
> +		if (left == prev_left) {
> +			err = -ENOTBLK;
> +		} else {
> +			fault_in_iov_iter_readable(from, left);
> +			prev_left = left;
> +			goto again;
> +		}
>  	}
>  
> -	if (written < 0 || !iov_iter_count(from)) {
> -		err = written;
> +	btrfs_inode_unlock(inode, ilock_flags);
> +
> +	/*
> +	 * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
> +	 * the fsync (call generic_write_sync()).
> +	 */
> +	if (is_sync_write)
> +		iocb->ki_flags |= IOCB_DSYNC;
> +
> +	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
> +	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
>  		goto out;
> -	}
>  
>  buffered:
>  	pos = iocb->ki_pos;
> @@ -1997,7 +2054,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
>  				 endbyte >> PAGE_SHIFT);
>  out:
> -	return written ? written : err;
> +	return err < 0 ? err : written;
>  }
>  
>  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> @@ -3649,6 +3706,7 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
>  static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	struct inode *inode = file_inode(iocb->ki_filp);
> +	ssize_t read = 0;
>  	ssize_t ret;
>  
>  	if (fsverity_active(inode))
> @@ -3658,10 +3716,48 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  		return 0;
>  
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> +again:
> +	/*
> +	 * This is similar to what we do for direct IO writes, see the comment
> +	 * at btrfs_direct_write(), but we also disable page faults in addition
> +	 * to disabling them only at the iov_iter level. This is because when
> +	 * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
> +	 * which can still trigger page fault ins despite having set ->nofault
> +	 * to true of our 'to' iov_iter.
> +	 *
> +	 * The difference to direct IO writes is that we deadlock when trying
> +	 * to lock the extent range in the inode's tree during he page reads
> +	 * triggered by the fault in (while for writes it is due to waiting for
> +	 * our own ordered extent). This is because for direct IO reads,
> +	 * btrfs_dio_iomap_begin() returns with the extent range locked, which
> +	 * is only unlocked in the endio callback (end_bio_extent_readpage()).
> +	 */
> +	pagefault_disable();
> +	to->nofault = true;
>  	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			   0, 0);
> +			   IOMAP_DIO_PARTIAL, read);
> +	to->nofault = false;
> +	pagefault_enable();
> +
> +	if (ret > 0)
> +		read = ret;
> +
> +	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
> +		/*
> +		 * We have more data left to read. Try to fault in as many as
> +		 * possible of the remainder pages and retry.
> +		 *
> +		 * Unlike for direct IO writes, in case the iov refers to the
> +		 * file and range we are reading from (due to a mmap), we don't
> +		 * need to worry about an infinite loop (see btrfs_direct_write())
> +		 * because iomap does not invalidate pages for reads, only does
> +		 * it for writes.
> +		 */
> +		fault_in_iov_iter_writeable(to, iov_iter_count(to));
> +		goto again;
> +	}
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> -	return ret;
> +	return ret < 0 ? ret : read;
>  }
>  
>  static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> -- 
> 2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-22  5:59 ` Wang Yugui
@ 2021-10-22 10:54   ` Filipe Manana
  2021-10-22 12:12     ` Wang Yugui
  0 siblings, 1 reply; 18+ messages in thread
From: Filipe Manana @ 2021-10-22 10:54 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Fri, Oct 22, 2021 at 6:59 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> I noticed a infinite loop of fstests/generic/475 when I apply this patch
> and "[PATCH v9 00/17] gfs2: Fix mmap + page fault deadlocks"

You mean v8? I can't find v9 anywhere.

>
> reproduce frequency: about 50%.

with v8, on top of current misc-next, I couldn't trigger any issues
after running g/475 for 50+ times.

>
> Call Trace 1:
> Oct 22 06:13:06 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid:2652125 ppid:     1 flags:0x00004006
> Oct 22 06:13:06 T7610 kernel: Call Trace:
> Oct 22 06:13:06 T7610 kernel: ? iomap_dio_rw+0xa/0x30
> Oct 22 06:13:06 T7610 kernel: ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> Oct 22 06:13:06 T7610 kernel: ? new_sync_read+0x118/0x1a0
> Oct 22 06:13:06 T7610 kernel: ? vfs_read+0xf1/0x190
> Oct 22 06:13:06 T7610 kernel: ? ksys_read+0x59/0xd0
> Oct 22 06:13:06 T7610 kernel: ? do_syscall_64+0x37/0x80
> Oct 22 06:13:06 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
>
>
> Call Trace 2:
> Oct 22 07:45:37 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid: 9584 ppid:     1 flags:0x00004006
> Oct 22 07:45:37 T7610 kernel: Call Trace:
> Oct 22 07:45:37 T7610 kernel: ? iomap_dio_complete+0x9e/0x140
> Oct 22 07:45:37 T7610 kernel: ? btrfs_file_read_iter+0x124/0x1c0 [btrfs]
> Oct 22 07:45:37 T7610 kernel: ? new_sync_read+0x118/0x1a0
> Oct 22 07:45:37 T7610 kernel: ? vfs_read+0xf1/0x190
> Oct 22 07:45:37 T7610 kernel: ? ksys_read+0x59/0xd0
> Oct 22 07:45:37 T7610 kernel: ? do_syscall_64+0x37/0x80
> Oct 22 07:45:37 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
>

Are those the complete traces?
It looks like too little, and inexact (the prefix ?).

>
> We noticed some  error in dmesg before this infinite loop.
> [15590.720909] BTRFS: error (device dm-0) in __btrfs_free_extent:3069: errno=-5 IO failure
> [15590.723014] BTRFS info (device dm-0): forced readonly
> [15590.725115] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2150: errno=-5 IO failure

Yes, that's expected, the test intentionally triggers simulated IO
failures, which can happen anywhere, not just when running delayed
references.

Thanks.

>
> [  293.404607] BTRFS error (device dm-0): error writing primary super block to device 1
> [  293.405872] BTRFS: error (device dm-0) in write_all_supers:4220: errno=-5 IO failure (1 errors while writing supers)
> [  293.407112] BTRFS info (device dm-0): forced readonly
> [  293.408225] Buffer I/O error on dev dm-0, logical block 3670000, async page read
> [  293.408378] BTRFS: error (device dm-0) in cleanup_transaction:1945: errno=-5 IO failure
> [  293.411037] BTRFS: error (device dm-0) in cleanup_transaction:1945: errno=-5 IO failure
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/10/22
>
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > If we do a direct IO read or write when the buffer given by the user is
> > memory mapped to the file range we are going to do IO, we end up ending
> > in a deadlock. This is triggered by the new test case generic/647 from
> > fstests.
> >
> > For a direct IO read we get a trace like this:
> >
> > [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> > [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> > [  967.875992] Call Trace:
> > [  967.875999]  __schedule+0x3ca/0xe10
> > [  967.876015]  schedule+0x43/0xe0
> > [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> > [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> > [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> > [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> > [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> > [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> > [  967.876253]  ? lru_cache_add+0x104/0x220
> > [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> > [  967.876258]  ? sched_clock_cpu+0xd/0x110
> > [  967.876263]  ? lock_release+0x155/0x4a0
> > [  967.876271]  read_pages+0x86/0x270
> > [  967.876274]  ? lru_cache_add+0x125/0x220
> > [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> > [  967.876291]  filemap_fault+0x626/0xa20
> > [  967.876303]  __do_fault+0x36/0xf0
> > [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> > [  967.876322]  handle_mm_fault+0x9e/0x260
> > [  967.876327]  __get_user_pages+0x204/0x620
> > [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> > [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> > [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> > [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> > [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> > [  967.876379]  ? lock_release+0x155/0x4a0
> > [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> > [  967.876396]  iomap_apply+0x12a/0x4a0
> > [  967.876398]  ? iomap_dio_rw+0x30/0x30
> > [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> > [  967.876415]  ? iomap_dio_rw+0x30/0x30
> > [  967.876420]  ? lock_acquired+0xf3/0x420
> > [  967.876429]  iomap_dio_rw+0xa/0x30
> > [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> > [  967.876460]  new_sync_read+0x118/0x1a0
> > [  967.876472]  vfs_read+0x128/0x1b0
> > [  967.876477]  __x64_sys_pread64+0x90/0xc0
> > [  967.876483]  do_syscall_64+0x3b/0xc0
> > [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  967.876490] RIP: 0033:0x7fb6f2c038d6
> > [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> > [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> > [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> > [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> > [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> > [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> >
> > This happens because at btrfs_dio_iomap_begin() we lock the extent range
> > and return with it locked - we only unlock in the endio callback, at
> > end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> > iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> > faults that resulting in reading the pages, through the readahead callback
> > btrfs_readahead(), and through there we end to attempt to lock again the
> > same extent range (or a subrange of what we locked before), resulting in
> > the deadlock.
> >
> > For a direct IO write, the scenario is a bit different, and it results in
> > trace like this:
> >
> > [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> > [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> > [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> > [ 1330.351906] Call Trace:
> > [ 1330.351913]  __schedule+0x3ca/0xe10
> > [ 1330.351930]  schedule+0x43/0xe0
> > [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> > [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> > [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> > [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> > [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> > [ 1330.352133]  ? lru_cache_add+0x104/0x220
> > [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> > [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> > [ 1330.352143]  ? lock_release+0x155/0x4a0
> > [ 1330.352151]  read_pages+0x86/0x270
> > [ 1330.352155]  ? lru_cache_add+0x125/0x220
> > [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> > [ 1330.352172]  filemap_fault+0x626/0xa20
> > [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> > [ 1330.352184]  __do_fault+0x36/0xf0
> > [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> > [ 1330.352203]  handle_mm_fault+0x9e/0x260
> > [ 1330.352208]  __get_user_pages+0x204/0x620
> > [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> > [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> > [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> > [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> > [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> > [ 1330.352259]  ? lock_release+0x155/0x4a0
> > [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> > [ 1330.352275]  iomap_apply+0x12a/0x4a0
> > [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> > [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> > [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> > [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> > [ 1330.352339]  new_sync_write+0x11f/0x1b0
> > [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> > [ 1330.352354]  vfs_write+0x292/0x3c0
> > [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> > [ 1330.352365]  do_syscall_64+0x3b/0xc0
> > [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [ 1330.352372] RIP: 0033:0x7f4b0a580986
> > [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> > [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> > [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> > [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> > [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> > [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >
> > Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> > range unlocked, but later when the page faults are triggered and we try
> > to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> > we find the ordered extent for our write, created by the iomap callback
> > btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> > deadlock since we can't complete the ordered extent without reading the
> > pages (the iomap code only submits the bio after the pages are faulted
> > in).
> >
> > Fix this by setting the nofault attribute of the given iov_iter and retry
> > the direct IO read/write if we get an -EFAULT error returned from iomap.
> > For reads, also disable page faults completely, this is because when we
> > read from a hole or a prealloc extent, we can still trigger page faults
> > due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> > oblivious to the value of the ->nofault attribute of an iov_iter.
> > We also need to keep track of the number of bytes written or read, and
> > pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> >
> > This depends on the iov_iter and iomap changes done by a recent patchset
> > from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> > moment of this writing. The cover letter has the following subject:
> >
> >    "[PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks"
> >
> > The thread can be found at:
> >
> > https://lore.kernel.org/all/20210827164926.1726765-1-agruenba@redhat.com/
> >
> > Fixing these issues could be done without the iov_iter and iomap changes
> > introduced in that patchset, however it would be much more complex due to
> > the need of reordering some operations for writes and having to be able
> > to pass some state through nested and deep call chains, which would be
> > particularly cumbersome for reads - for example make the readahead and
> > the endio handlers for page reads be aware we are in a direct IO read
> > context and know which inode and extent range we locked before.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >
> > As noted in the changelog, this currently depends on an unmerged patchset
> > that changes the iov_iter and iomap code. Unfortunately without that
> > patchset merged, the solution for this bug would be much more complex
> > and hairy.
> >
> >  fs/btrfs/file.c | 128 ++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 112 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 9d41b28c67ba..a020fa5b077c 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1904,16 +1904,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
> >
> >  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >  {
> > +     const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
> >       struct file *file = iocb->ki_filp;
> >       struct inode *inode = file_inode(file);
> >       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >       loff_t pos;
> >       ssize_t written = 0;
> >       ssize_t written_buffered;
> > +     size_t prev_left = 0;
> >       loff_t endbyte;
> >       ssize_t err;
> >       unsigned int ilock_flags = 0;
> > -     struct iomap_dio *dio = NULL;
> >
> >       if (iocb->ki_flags & IOCB_NOWAIT)
> >               ilock_flags |= BTRFS_ILOCK_TRY;
> > @@ -1956,23 +1957,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >               goto buffered;
> >       }
> >
> > -     dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > -                          0, 0);
> > +     /*
> > +      * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> > +      * calls generic_write_sync() (through iomap_dio_complete()), because
> > +      * that results in calling fsync (btrfs_sync_file()) which will try to
> > +      * lock the inode in exclusive/write mode.
> > +      */
> > +     if (is_sync_write)
> > +             iocb->ki_flags &= ~IOCB_DSYNC;
> >
> > -     btrfs_inode_unlock(inode, ilock_flags);
> > +     /*
> > +      * The iov_iter can be mapped to the same file range we are writing to.
> > +      * If that's the case, then we will deadlock in the iomap code, because
> > +      * it first calls our callback btrfs_dio_iomap_begin(), which will create
> > +      * an ordered extent, and after that it will fault in the pages that the
> > +      * iov_iter refers to. During the fault in we end up in the readahead
> > +      * pages code (starting at btrfs_readahead()), which will lock the range,
> > +      * find that ordered extent and then wait for it to complete (at
> > +      * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> > +      * obviously the ordered extent can never complete as we didn't submit
> > +      * yet the respective bio(s). This always happens when the buffer is
> > +      * memory mapped to the same file range, since the iomap DIO code always
> > +      * invalidates pages in the target file range (after starting and waiting
> > +      * for any writeback).
> > +      *
> > +      * So here we disable page faults in the iov_iter and then retry if we
> > +      * got -EFAULT, faulting in the pages before the retry.
> > +      */
> > +again:
> > +     from->nofault = true;
> > +     err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > +                        IOMAP_DIO_PARTIAL, written);
> > +     from->nofault = false;
> >
> > -     if (IS_ERR_OR_NULL(dio)) {
> > -             err = PTR_ERR_OR_ZERO(dio);
> > -             if (err < 0 && err != -ENOTBLK)
> > -                     goto out;
> > -     } else {
> > -             written = iomap_dio_complete(dio);
> > +     if (err > 0)
> > +             written = err;
> > +
> > +     if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
> > +             const size_t left = iov_iter_count(from);
> > +             /*
> > +              * We have more data left to write. Try to fault in as many as
> > +              * possible of the remainder pages and retry. We do this without
> > +              * releasing and locking again the inode, to prevent races with
> > +              * truncate.
> > +              *
> > +              * Also, in case the iov refers to pages in the file range of the
> > +              * file we want to write to (due to a mmap), we could enter an
> > +              * infinite loop if we retry after faulting the pages in, since
> > +              * iomap will invalidate any pages in the range early on, before
> > +              * it tries to fault in the pages of the iov. So we keep track of
> > +              * how much was left of iov in the previous EFAULT and fallback
> > +              * to buffered IO in case we haven't made any progress.
> > +              */
> > +             if (left == prev_left) {
> > +                     err = -ENOTBLK;
> > +             } else {
> > +                     fault_in_iov_iter_readable(from, left);
> > +                     prev_left = left;
> > +                     goto again;
> > +             }
> >       }
> >
> > -     if (written < 0 || !iov_iter_count(from)) {
> > -             err = written;
> > +     btrfs_inode_unlock(inode, ilock_flags);
> > +
> > +     /*
> > +      * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
> > +      * the fsync (call generic_write_sync()).
> > +      */
> > +     if (is_sync_write)
> > +             iocb->ki_flags |= IOCB_DSYNC;
> > +
> > +     /* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
> > +     if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
> >               goto out;
> > -     }
> >
> >  buffered:
> >       pos = iocb->ki_pos;
> > @@ -1997,7 +2054,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >       invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
> >                                endbyte >> PAGE_SHIFT);
> >  out:
> > -     return written ? written : err;
> > +     return err < 0 ? err : written;
> >  }
> >
> >  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > @@ -3649,6 +3706,7 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
> >  static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> >  {
> >       struct inode *inode = file_inode(iocb->ki_filp);
> > +     ssize_t read = 0;
> >       ssize_t ret;
> >
> >       if (fsverity_active(inode))
> > @@ -3658,10 +3716,48 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> >               return 0;
> >
> >       btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> > +again:
> > +     /*
> > +      * This is similar to what we do for direct IO writes, see the comment
> > +      * at btrfs_direct_write(), but we also disable page faults in addition
> > +      * to disabling them only at the iov_iter level. This is because when
> > +      * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
> > +      * which can still trigger page fault ins despite having set ->nofault
> > +      * to true of our 'to' iov_iter.
> > +      *
> > +      * The difference to direct IO writes is that we deadlock when trying
> > +      * to lock the extent range in the inode's tree during he page reads
> > +      * triggered by the fault in (while for writes it is due to waiting for
> > +      * our own ordered extent). This is because for direct IO reads,
> > +      * btrfs_dio_iomap_begin() returns with the extent range locked, which
> > +      * is only unlocked in the endio callback (end_bio_extent_readpage()).
> > +      */
> > +     pagefault_disable();
> > +     to->nofault = true;
> >       ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > -                        0, 0);
> > +                        IOMAP_DIO_PARTIAL, read);
> > +     to->nofault = false;
> > +     pagefault_enable();
> > +
> > +     if (ret > 0)
> > +             read = ret;
> > +
> > +     if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
> > +             /*
> > +              * We have more data left to read. Try to fault in as many as
> > +              * possible of the remainder pages and retry.
> > +              *
> > +              * Unlike for direct IO writes, in case the iov refers to the
> > +              * file and range we are reading from (due to a mmap), we don't
> > +              * need to worry about an infinite loop (see btrfs_direct_write())
> > +              * because iomap does not invalidate pages for reads, only does
> > +              * it for writes.
> > +              */
> > +             fault_in_iov_iter_writeable(to, iov_iter_count(to));
> > +             goto again;
> > +     }
> >       btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> > -     return ret;
> > +     return ret < 0 ? ret : read;
> >  }
> >
> >  static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > --
> > 2.33.0
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-22 10:54   ` Filipe Manana
@ 2021-10-22 12:12     ` Wang Yugui
  2021-10-22 13:17       ` Filipe Manana
  0 siblings, 1 reply; 18+ messages in thread
From: Wang Yugui @ 2021-10-22 12:12 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

Hi,

> On Fri, Oct 22, 2021 at 6:59 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >
> > Hi,
> >
> > I noticed a infinite loop of fstests/generic/475 when I apply this patch
> > and "[PATCH v9 00/17] gfs2: Fix mmap + page fault deadlocks"
> 
> You mean v8? I can't find v9 anywhere.

Yes. It is v8.


> >
> > reproduce frequency: about 50%.
> 
> with v8, on top of current misc-next, I couldn't trigger any issues
> after running g/475 for 50+ times.
> 
> >
> > Call Trace 1:
> > Oct 22 06:13:06 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid:2652125 ppid:     1 flags:0x00004006
> > Oct 22 06:13:06 T7610 kernel: Call Trace:
> > Oct 22 06:13:06 T7610 kernel: ? iomap_dio_rw+0xa/0x30
> > Oct 22 06:13:06 T7610 kernel: ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > Oct 22 06:13:06 T7610 kernel: ? new_sync_read+0x118/0x1a0
> > Oct 22 06:13:06 T7610 kernel: ? vfs_read+0xf1/0x190
> > Oct 22 06:13:06 T7610 kernel: ? ksys_read+0x59/0xd0
> > Oct 22 06:13:06 T7610 kernel: ? do_syscall_64+0x37/0x80
> > Oct 22 06:13:06 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> >
> > Call Trace 2:
> > Oct 22 07:45:37 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid: 9584 ppid:     1 flags:0x00004006
> > Oct 22 07:45:37 T7610 kernel: Call Trace:
> > Oct 22 07:45:37 T7610 kernel: ? iomap_dio_complete+0x9e/0x140
> > Oct 22 07:45:37 T7610 kernel: ? btrfs_file_read_iter+0x124/0x1c0 [btrfs]
> > Oct 22 07:45:37 T7610 kernel: ? new_sync_read+0x118/0x1a0
> > Oct 22 07:45:37 T7610 kernel: ? vfs_read+0xf1/0x190
> > Oct 22 07:45:37 T7610 kernel: ? ksys_read+0x59/0xd0
> > Oct 22 07:45:37 T7610 kernel: ? do_syscall_64+0x37/0x80
> > Oct 22 07:45:37 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> 
> Are those the complete traces?
> It looks like too little, and inexact (the prefix ?).

Yes. these are complete traces.  I do not know the reason of 'the prefix ?'

I run fstests/generic/475 2 times again.
- failed to reproduce on SSD/SAS
- sucessed to reproduce on SSD/NVMe

Then I gathered 'sysrq -t' for 3 times.

[  909.099550] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
[  909.100594] Call Trace:
[  909.101633]  ? __clear_user+0x40/0x70
[  909.102675]  ? lock_release+0x1c6/0x270
[  909.103717]  ? alloc_extent_state+0xc1/0x190 [btrfs]
[  909.104822]  ? fixup_exception+0x41/0x60
[  909.105881]  ? rcu_read_lock_held_common+0xe/0x40
[  909.106924]  ? rcu_read_lock_sched_held+0x23/0x80
[  909.107947]  ? rcu_read_lock_sched_held+0x23/0x80
[  909.108966]  ? slab_post_alloc_hook+0x50/0x340
[  909.109993]  ? trace_hardirqs_on+0x1a/0xd0
[  909.111039]  ? lock_extent_bits+0x64/0x90 [btrfs]
[  909.112202]  ? __clear_extent_bit+0x37a/0x530 [btrfs]
[  909.113366]  ? filemap_write_and_wait_range+0x87/0xd0
[  909.114455]  ? clear_extent_bit+0x15/0x20 [btrfs]
[  909.115628]  ? __iomap_dio_rw+0x284/0x830
[  909.116741]  ? find_vma+0x32/0xb0
[  909.117868]  ? __get_user_pages+0xba/0x740
[  909.118971]  ? iomap_dio_rw+0xa/0x30
[  909.120069]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
[  909.121219]  ? new_sync_read+0x11b/0x1b0
[  909.122301]  ? vfs_read+0xf7/0x190
[  909.123373]  ? ksys_read+0x5f/0xe0
[  909.124451]  ? do_syscall_64+0x37/0x80
[  909.125556]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae

[ 1066.293028] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
[ 1066.294069] Call Trace:
[ 1066.295111]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
[ 1066.296213]  ? new_sync_read+0x11b/0x1b0
[ 1066.297268]  ? vfs_read+0xf7/0x190
[ 1066.298314]  ? ksys_read+0x5f/0xe0
[ 1066.299359]  ? do_syscall_64+0x37/0x80
[ 1066.300394]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae


[ 1201.027178] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
[ 1201.028233] Call Trace:
[ 1201.029298]  ? iomap_dio_rw+0xa/0x30
[ 1201.030352]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
[ 1201.031465]  ? new_sync_read+0x11b/0x1b0
[ 1201.032534]  ? vfs_read+0xf7/0x190
[ 1201.033592]  ? ksys_read+0x5f/0xe0
[ 1201.034633]  ? do_syscall_64+0x37/0x80
[ 1201.035661]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae

By the way, I enable ' -O no-holes -R free-space-tree' for mkfs.btrfs by
default.


> >
> > We noticed some  error in dmesg before this infinite loop.
> > [15590.720909] BTRFS: error (device dm-0) in __btrfs_free_extent:3069: errno=-5 IO failure
> > [15590.723014] BTRFS info (device dm-0): forced readonly
> > [15590.725115] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2150: errno=-5 IO failure
> 
> Yes, that's expected, the test intentionally triggers simulated IO
> failures, which can happen anywhere, not just when running delayed
> references.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/10/22



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-22 12:12     ` Wang Yugui
@ 2021-10-22 13:17       ` Filipe Manana
  2021-10-23  3:58         ` Wang Yugui
  0 siblings, 1 reply; 18+ messages in thread
From: Filipe Manana @ 2021-10-22 13:17 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5272 bytes --]

On Fri, Oct 22, 2021 at 1:12 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> > On Fri, Oct 22, 2021 at 6:59 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >
> > > Hi,
> > >
> > > I noticed a infinite loop of fstests/generic/475 when I apply this patch
> > > and "[PATCH v9 00/17] gfs2: Fix mmap + page fault deadlocks"
> >
> > You mean v8? I can't find v9 anywhere.
>
> Yes. It is v8.
>
>
> > >
> > > reproduce frequency: about 50%.
> >
> > with v8, on top of current misc-next, I couldn't trigger any issues
> > after running g/475 for 50+ times.
> >
> > >
> > > Call Trace 1:
> > > Oct 22 06:13:06 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid:2652125 ppid:     1 flags:0x00004006
> > > Oct 22 06:13:06 T7610 kernel: Call Trace:
> > > Oct 22 06:13:06 T7610 kernel: ? iomap_dio_rw+0xa/0x30
> > > Oct 22 06:13:06 T7610 kernel: ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > > Oct 22 06:13:06 T7610 kernel: ? new_sync_read+0x118/0x1a0
> > > Oct 22 06:13:06 T7610 kernel: ? vfs_read+0xf1/0x190
> > > Oct 22 06:13:06 T7610 kernel: ? ksys_read+0x59/0xd0
> > > Oct 22 06:13:06 T7610 kernel: ? do_syscall_64+0x37/0x80
> > > Oct 22 06:13:06 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > >
> > > Call Trace 2:
> > > Oct 22 07:45:37 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid: 9584 ppid:     1 flags:0x00004006
> > > Oct 22 07:45:37 T7610 kernel: Call Trace:
> > > Oct 22 07:45:37 T7610 kernel: ? iomap_dio_complete+0x9e/0x140
> > > Oct 22 07:45:37 T7610 kernel: ? btrfs_file_read_iter+0x124/0x1c0 [btrfs]
> > > Oct 22 07:45:37 T7610 kernel: ? new_sync_read+0x118/0x1a0
> > > Oct 22 07:45:37 T7610 kernel: ? vfs_read+0xf1/0x190
> > > Oct 22 07:45:37 T7610 kernel: ? ksys_read+0x59/0xd0
> > > Oct 22 07:45:37 T7610 kernel: ? do_syscall_64+0x37/0x80
> > > Oct 22 07:45:37 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> >
> > Are those the complete traces?
> > It looks like too little, and inexact (the prefix ?).
>
> Yes. these are complete traces.  I do not know the reason of 'the prefix ?'
>
> I run fstests/generic/475 2 times again.
> - failed to reproduce on SSD/SAS
> - sucessed to reproduce on SSD/NVMe
>
> Then I gathered 'sysrq -t' for 3 times.
>
> [  909.099550] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> [  909.100594] Call Trace:
> [  909.101633]  ? __clear_user+0x40/0x70
> [  909.102675]  ? lock_release+0x1c6/0x270
> [  909.103717]  ? alloc_extent_state+0xc1/0x190 [btrfs]
> [  909.104822]  ? fixup_exception+0x41/0x60
> [  909.105881]  ? rcu_read_lock_held_common+0xe/0x40
> [  909.106924]  ? rcu_read_lock_sched_held+0x23/0x80
> [  909.107947]  ? rcu_read_lock_sched_held+0x23/0x80
> [  909.108966]  ? slab_post_alloc_hook+0x50/0x340
> [  909.109993]  ? trace_hardirqs_on+0x1a/0xd0
> [  909.111039]  ? lock_extent_bits+0x64/0x90 [btrfs]
> [  909.112202]  ? __clear_extent_bit+0x37a/0x530 [btrfs]
> [  909.113366]  ? filemap_write_and_wait_range+0x87/0xd0
> [  909.114455]  ? clear_extent_bit+0x15/0x20 [btrfs]
> [  909.115628]  ? __iomap_dio_rw+0x284/0x830
> [  909.116741]  ? find_vma+0x32/0xb0
> [  909.117868]  ? __get_user_pages+0xba/0x740
> [  909.118971]  ? iomap_dio_rw+0xa/0x30
> [  909.120069]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> [  909.121219]  ? new_sync_read+0x11b/0x1b0
> [  909.122301]  ? vfs_read+0xf7/0x190
> [  909.123373]  ? ksys_read+0x5f/0xe0
> [  909.124451]  ? do_syscall_64+0x37/0x80
> [  909.125556]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> [ 1066.293028] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> [ 1066.294069] Call Trace:
> [ 1066.295111]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> [ 1066.296213]  ? new_sync_read+0x11b/0x1b0
> [ 1066.297268]  ? vfs_read+0xf7/0x190
> [ 1066.298314]  ? ksys_read+0x5f/0xe0
> [ 1066.299359]  ? do_syscall_64+0x37/0x80
> [ 1066.300394]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
>
>
> [ 1201.027178] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> [ 1201.028233] Call Trace:
> [ 1201.029298]  ? iomap_dio_rw+0xa/0x30
> [ 1201.030352]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> [ 1201.031465]  ? new_sync_read+0x11b/0x1b0
> [ 1201.032534]  ? vfs_read+0xf7/0x190
> [ 1201.033592]  ? ksys_read+0x5f/0xe0
> [ 1201.034633]  ? do_syscall_64+0x37/0x80
> [ 1201.035661]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> By the way, I enable ' -O no-holes -R free-space-tree' for mkfs.btrfs by
> default.

Those mkfs options/fs features should be irrelevant.

Can you try with the attached patch applied on top of those patches?

Thanks.

>
>
> > >
> > > We noticed some  error in dmesg before this infinite loop.
> > > [15590.720909] BTRFS: error (device dm-0) in __btrfs_free_extent:3069: errno=-5 IO failure
> > > [15590.723014] BTRFS info (device dm-0): forced readonly
> > > [15590.725115] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2150: errno=-5 IO failure
> >
> > Yes, that's expected, the test intentionally triggers simulated IO
> > failures, which can happen anywhere, not just when running delayed
> > references.
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/10/22
>
>

[-- Attachment #2: foobar.patch --]
[-- Type: text/x-patch, Size: 1559 bytes --]

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 91a5fa814d75..19feb4cabb55 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3716,6 +3716,7 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
 static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	size_t prev_left = 0;
 	ssize_t read = 0;
 	ssize_t ret;
 
@@ -3753,18 +3754,23 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		read = ret;
 
 	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
+		const size_t left = iov_iter_count(to);
+
 		/*
 		 * We have more data left to read. Try to fault in as many as
 		 * possible of the remainder pages and retry.
-		 *
-		 * Unlike for direct IO writes, in case the iov refers to the
-		 * file and range we are reading from (due to a mmap), we don't
-		 * need to worry about an infinite loop (see btrfs_direct_write())
-		 * because iomap does not invalidate pages for reads, only does
-		 * it for writes.
 		 */
-		fault_in_iov_iter_writeable(to, iov_iter_count(to));
-		goto again;
+		if (prev_left == 0 || left < prev_left) {
+			fault_in_iov_iter_writeable(to, left);
+			prev_left = left;
+			goto again;
+		}
+		/*
+		 * If we didn't make any progress since the last attempt, fallback
+		 * to a buffered read for the remainder of the range.
+		 * This is just to avoid any possibility of looping for too long.
+		 */
+		ret = read;
 	}
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 	return ret < 0 ? ret : read;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-22 13:17       ` Filipe Manana
@ 2021-10-23  3:58         ` Wang Yugui
  2021-10-25  9:41           ` Filipe Manana
  0 siblings, 1 reply; 18+ messages in thread
From: Wang Yugui @ 2021-10-23  3:58 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

hi,

With this new patch, xfstest/generic/475 and xfstest/generic/650 works well.

Thanks a lot.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/10/23

> On Fri, Oct 22, 2021 at 1:12 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >
> > Hi,
> >
> > > On Fri, Oct 22, 2021 at 6:59 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I noticed a infinite loop of fstests/generic/475 when I apply this patch
> > > > and "[PATCH v9 00/17] gfs2: Fix mmap + page fault deadlocks"
> > >
> > > You mean v8? I can't find v9 anywhere.
> >
> > Yes. It is v8.
> >
> >
> > > >
> > > > reproduce frequency: about 50%.
> > >
> > > with v8, on top of current misc-next, I couldn't trigger any issues
> > > after running g/475 for 50+ times.
> > >
> > > >
> > > > Call Trace 1:
> > > > Oct 22 06:13:06 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid:2652125 ppid:     1 flags:0x00004006
> > > > Oct 22 06:13:06 T7610 kernel: Call Trace:
> > > > Oct 22 06:13:06 T7610 kernel: ? iomap_dio_rw+0xa/0x30
> > > > Oct 22 06:13:06 T7610 kernel: ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > > > Oct 22 06:13:06 T7610 kernel: ? new_sync_read+0x118/0x1a0
> > > > Oct 22 06:13:06 T7610 kernel: ? vfs_read+0xf1/0x190
> > > > Oct 22 06:13:06 T7610 kernel: ? ksys_read+0x59/0xd0
> > > > Oct 22 06:13:06 T7610 kernel: ? do_syscall_64+0x37/0x80
> > > > Oct 22 06:13:06 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > >
> > > >
> > > > Call Trace 2:
> > > > Oct 22 07:45:37 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid: 9584 ppid:     1 flags:0x00004006
> > > > Oct 22 07:45:37 T7610 kernel: Call Trace:
> > > > Oct 22 07:45:37 T7610 kernel: ? iomap_dio_complete+0x9e/0x140
> > > > Oct 22 07:45:37 T7610 kernel: ? btrfs_file_read_iter+0x124/0x1c0 [btrfs]
> > > > Oct 22 07:45:37 T7610 kernel: ? new_sync_read+0x118/0x1a0
> > > > Oct 22 07:45:37 T7610 kernel: ? vfs_read+0xf1/0x190
> > > > Oct 22 07:45:37 T7610 kernel: ? ksys_read+0x59/0xd0
> > > > Oct 22 07:45:37 T7610 kernel: ? do_syscall_64+0x37/0x80
> > > > Oct 22 07:45:37 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > >
> > >
> > > Are those the complete traces?
> > > It looks like too little, and inexact (the prefix ?).
> >
> > Yes. these are complete traces.  I do not know the reason of 'the prefix ?'
> >
> > I run fstests/generic/475 2 times again.
> > - failed to reproduce on SSD/SAS
> > - sucessed to reproduce on SSD/NVMe
> >
> > Then I gathered 'sysrq -t' for 3 times.
> >
> > [  909.099550] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> > [  909.100594] Call Trace:
> > [  909.101633]  ? __clear_user+0x40/0x70
> > [  909.102675]  ? lock_release+0x1c6/0x270
> > [  909.103717]  ? alloc_extent_state+0xc1/0x190 [btrfs]
> > [  909.104822]  ? fixup_exception+0x41/0x60
> > [  909.105881]  ? rcu_read_lock_held_common+0xe/0x40
> > [  909.106924]  ? rcu_read_lock_sched_held+0x23/0x80
> > [  909.107947]  ? rcu_read_lock_sched_held+0x23/0x80
> > [  909.108966]  ? slab_post_alloc_hook+0x50/0x340
> > [  909.109993]  ? trace_hardirqs_on+0x1a/0xd0
> > [  909.111039]  ? lock_extent_bits+0x64/0x90 [btrfs]
> > [  909.112202]  ? __clear_extent_bit+0x37a/0x530 [btrfs]
> > [  909.113366]  ? filemap_write_and_wait_range+0x87/0xd0
> > [  909.114455]  ? clear_extent_bit+0x15/0x20 [btrfs]
> > [  909.115628]  ? __iomap_dio_rw+0x284/0x830
> > [  909.116741]  ? find_vma+0x32/0xb0
> > [  909.117868]  ? __get_user_pages+0xba/0x740
> > [  909.118971]  ? iomap_dio_rw+0xa/0x30
> > [  909.120069]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > [  909.121219]  ? new_sync_read+0x11b/0x1b0
> > [  909.122301]  ? vfs_read+0xf7/0x190
> > [  909.123373]  ? ksys_read+0x5f/0xe0
> > [  909.124451]  ? do_syscall_64+0x37/0x80
> > [  909.125556]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > [ 1066.293028] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> > [ 1066.294069] Call Trace:
> > [ 1066.295111]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > [ 1066.296213]  ? new_sync_read+0x11b/0x1b0
> > [ 1066.297268]  ? vfs_read+0xf7/0x190
> > [ 1066.298314]  ? ksys_read+0x5f/0xe0
> > [ 1066.299359]  ? do_syscall_64+0x37/0x80
> > [ 1066.300394]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> >
> > [ 1201.027178] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> > [ 1201.028233] Call Trace:
> > [ 1201.029298]  ? iomap_dio_rw+0xa/0x30
> > [ 1201.030352]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > [ 1201.031465]  ? new_sync_read+0x11b/0x1b0
> > [ 1201.032534]  ? vfs_read+0xf7/0x190
> > [ 1201.033592]  ? ksys_read+0x5f/0xe0
> > [ 1201.034633]  ? do_syscall_64+0x37/0x80
> > [ 1201.035661]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> >
> > By the way, I enable ' -O no-holes -R free-space-tree' for mkfs.btrfs by
> > default.
> 
> Those mkfs options/fs features should be irrelevant.
> 
> Can you try with the attached patch applied on top of those patches?
> 
> Thanks.
> 
> >
> >
> > > >
> > > > We noticed some  error in dmesg before this infinite loop.
> > > > [15590.720909] BTRFS: error (device dm-0) in __btrfs_free_extent:3069: errno=-5 IO failure
> > > > [15590.723014] BTRFS info (device dm-0): forced readonly
> > > > [15590.725115] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2150: errno=-5 IO failure
> > >
> > > Yes, that's expected, the test intentionally triggers simulated IO
> > > failures, which can happen anywhere, not just when running delayed
> > > references.
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/10/22
> >
> >



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-23  3:58         ` Wang Yugui
@ 2021-10-25  9:41           ` Filipe Manana
  0 siblings, 0 replies; 18+ messages in thread
From: Filipe Manana @ 2021-10-25  9:41 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Sat, Oct 23, 2021 at 4:58 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> hi,
>
> With this new patch, xfstest/generic/475 and xfstest/generic/650 works well.
>
> Thanks a lot.

Thanks for testing and reporting.
I'll integrate the patch into a v2.
Feel free to comment with a Tested-by tag on it if you want to.

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/10/23
>
> > On Fri, Oct 22, 2021 at 1:12 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >
> > > Hi,
> > >
> > > > On Fri, Oct 22, 2021 at 6:59 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I noticed a infinite loop of fstests/generic/475 when I apply this patch
> > > > > and "[PATCH v9 00/17] gfs2: Fix mmap + page fault deadlocks"
> > > >
> > > > You mean v8? I can't find v9 anywhere.
> > >
> > > Yes. It is v8.
> > >
> > >
> > > > >
> > > > > reproduce frequency: about 50%.
> > > >
> > > > with v8, on top of current misc-next, I couldn't trigger any issues
> > > > after running g/475 for 50+ times.
> > > >
> > > > >
> > > > > Call Trace 1:
> > > > > Oct 22 06:13:06 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid:2652125 ppid:     1 flags:0x00004006
> > > > > Oct 22 06:13:06 T7610 kernel: Call Trace:
> > > > > Oct 22 06:13:06 T7610 kernel: ? iomap_dio_rw+0xa/0x30
> > > > > Oct 22 06:13:06 T7610 kernel: ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > > > > Oct 22 06:13:06 T7610 kernel: ? new_sync_read+0x118/0x1a0
> > > > > Oct 22 06:13:06 T7610 kernel: ? vfs_read+0xf1/0x190
> > > > > Oct 22 06:13:06 T7610 kernel: ? ksys_read+0x59/0xd0
> > > > > Oct 22 06:13:06 T7610 kernel: ? do_syscall_64+0x37/0x80
> > > > > Oct 22 06:13:06 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > >
> > > > >
> > > > > Call Trace 2:
> > > > > Oct 22 07:45:37 T7610 kernel: task:fsstress        state:R  running task     stack:    0 pid: 9584 ppid:     1 flags:0x00004006
> > > > > Oct 22 07:45:37 T7610 kernel: Call Trace:
> > > > > Oct 22 07:45:37 T7610 kernel: ? iomap_dio_complete+0x9e/0x140
> > > > > Oct 22 07:45:37 T7610 kernel: ? btrfs_file_read_iter+0x124/0x1c0 [btrfs]
> > > > > Oct 22 07:45:37 T7610 kernel: ? new_sync_read+0x118/0x1a0
> > > > > Oct 22 07:45:37 T7610 kernel: ? vfs_read+0xf1/0x190
> > > > > Oct 22 07:45:37 T7610 kernel: ? ksys_read+0x59/0xd0
> > > > > Oct 22 07:45:37 T7610 kernel: ? do_syscall_64+0x37/0x80
> > > > > Oct 22 07:45:37 T7610 kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > >
> > > >
> > > > Are those the complete traces?
> > > > It looks like too little, and inexact (the prefix ?).
> > >
> > > Yes. these are complete traces.  I do not know the reason of 'the prefix ?'
> > >
> > > I run fstests/generic/475 2 times again.
> > > - failed to reproduce on SSD/SAS
> > > - sucessed to reproduce on SSD/NVMe
> > >
> > > Then I gathered 'sysrq -t' for 3 times.
> > >
> > > [  909.099550] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> > > [  909.100594] Call Trace:
> > > [  909.101633]  ? __clear_user+0x40/0x70
> > > [  909.102675]  ? lock_release+0x1c6/0x270
> > > [  909.103717]  ? alloc_extent_state+0xc1/0x190 [btrfs]
> > > [  909.104822]  ? fixup_exception+0x41/0x60
> > > [  909.105881]  ? rcu_read_lock_held_common+0xe/0x40
> > > [  909.106924]  ? rcu_read_lock_sched_held+0x23/0x80
> > > [  909.107947]  ? rcu_read_lock_sched_held+0x23/0x80
> > > [  909.108966]  ? slab_post_alloc_hook+0x50/0x340
> > > [  909.109993]  ? trace_hardirqs_on+0x1a/0xd0
> > > [  909.111039]  ? lock_extent_bits+0x64/0x90 [btrfs]
> > > [  909.112202]  ? __clear_extent_bit+0x37a/0x530 [btrfs]
> > > [  909.113366]  ? filemap_write_and_wait_range+0x87/0xd0
> > > [  909.114455]  ? clear_extent_bit+0x15/0x20 [btrfs]
> > > [  909.115628]  ? __iomap_dio_rw+0x284/0x830
> > > [  909.116741]  ? find_vma+0x32/0xb0
> > > [  909.117868]  ? __get_user_pages+0xba/0x740
> > > [  909.118971]  ? iomap_dio_rw+0xa/0x30
> > > [  909.120069]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > > [  909.121219]  ? new_sync_read+0x11b/0x1b0
> > > [  909.122301]  ? vfs_read+0xf7/0x190
> > > [  909.123373]  ? ksys_read+0x5f/0xe0
> > > [  909.124451]  ? do_syscall_64+0x37/0x80
> > > [  909.125556]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > > [ 1066.293028] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> > > [ 1066.294069] Call Trace:
> > > [ 1066.295111]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > > [ 1066.296213]  ? new_sync_read+0x11b/0x1b0
> > > [ 1066.297268]  ? vfs_read+0xf7/0x190
> > > [ 1066.298314]  ? ksys_read+0x5f/0xe0
> > > [ 1066.299359]  ? do_syscall_64+0x37/0x80
> > > [ 1066.300394]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > >
> > > [ 1201.027178] task:fsstress        state:R  running task     stack:    0 pid: 9269 ppid:     1 flags:0x00004006
> > > [ 1201.028233] Call Trace:
> > > [ 1201.029298]  ? iomap_dio_rw+0xa/0x30
> > > [ 1201.030352]  ? btrfs_file_read_iter+0x157/0x1c0 [btrfs]
> > > [ 1201.031465]  ? new_sync_read+0x11b/0x1b0
> > > [ 1201.032534]  ? vfs_read+0xf7/0x190
> > > [ 1201.033592]  ? ksys_read+0x5f/0xe0
> > > [ 1201.034633]  ? do_syscall_64+0x37/0x80
> > > [ 1201.035661]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >
> > > By the way, I enable ' -O no-holes -R free-space-tree' for mkfs.btrfs by
> > > default.
> >
> > Those mkfs options/fs features should be irrelevant.
> >
> > Can you try with the attached patch applied on top of those patches?
> >
> > Thanks.
> >
> > >
> > >
> > > > >
> > > > > We noticed some  error in dmesg before this infinite loop.
> > > > > [15590.720909] BTRFS: error (device dm-0) in __btrfs_free_extent:3069: errno=-5 IO failure
> > > > > [15590.723014] BTRFS info (device dm-0): forced readonly
> > > > > [15590.725115] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2150: errno=-5 IO failure
> > > >
> > > > Yes, that's expected, the test intentionally triggers simulated IO
> > > > failures, which can happen anywhere, not just when running delayed
> > > > references.
> > >
> > > Best Regards
> > > Wang Yugui (wangyugui@e16-tech.com)
> > > 2021/10/22
> > >
> > >
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-09-08 10:50 [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes fdmanana
  2021-09-09 19:21 ` Boris Burkov
  2021-10-22  5:59 ` Wang Yugui
@ 2021-10-25  9:42 ` fdmanana
  2021-10-25 14:42   ` Josef Bacik
  2021-10-25 16:27 ` [PATCH v3] " fdmanana
  3 siblings, 1 reply; 18+ messages in thread
From: fdmanana @ 2021-10-25  9:42 UTC (permalink / raw)
  To: linux-btrfs; +Cc: wangyugui, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.

For a direct IO read we get a trace like this:

[  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
[  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
[  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
[  967.875992] Call Trace:
[  967.875999]  __schedule+0x3ca/0xe10
[  967.876015]  schedule+0x43/0xe0
[  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
[  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
[  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
[  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
[  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
[  967.876253]  ? lru_cache_add+0x104/0x220
[  967.876255]  ? kvm_sched_clock_read+0x14/0x40
[  967.876258]  ? sched_clock_cpu+0xd/0x110
[  967.876263]  ? lock_release+0x155/0x4a0
[  967.876271]  read_pages+0x86/0x270
[  967.876274]  ? lru_cache_add+0x125/0x220
[  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
[  967.876291]  filemap_fault+0x626/0xa20
[  967.876303]  __do_fault+0x36/0xf0
[  967.876308]  __handle_mm_fault+0x83f/0x15f0
[  967.876322]  handle_mm_fault+0x9e/0x260
[  967.876327]  __get_user_pages+0x204/0x620
[  967.876332]  ? get_user_pages_unlocked+0x69/0x340
[  967.876340]  get_user_pages_unlocked+0xd3/0x340
[  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
[  967.876366]  iov_iter_get_pages+0x8d/0x3a0
[  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
[  967.876379]  ? lock_release+0x155/0x4a0
[  967.876387]  iomap_dio_bio_actor+0x232/0x410
[  967.876396]  iomap_apply+0x12a/0x4a0
[  967.876398]  ? iomap_dio_rw+0x30/0x30
[  967.876414]  __iomap_dio_rw+0x29f/0x5e0
[  967.876415]  ? iomap_dio_rw+0x30/0x30
[  967.876420]  ? lock_acquired+0xf3/0x420
[  967.876429]  iomap_dio_rw+0xa/0x30
[  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
[  967.876460]  new_sync_read+0x118/0x1a0
[  967.876472]  vfs_read+0x128/0x1b0
[  967.876477]  __x64_sys_pread64+0x90/0xc0
[  967.876483]  do_syscall_64+0x3b/0xc0
[  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  967.876490] RIP: 0033:0x7fb6f2c038d6
[  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
[  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
[  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
[  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000

This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.

For a direct IO write, the scenario is a bit different, and it results in
trace like this:

[ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
[ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
[ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
[ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
[ 1330.351906] Call Trace:
[ 1330.351913]  __schedule+0x3ca/0xe10
[ 1330.351930]  schedule+0x43/0xe0
[ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
[ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
[ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
[ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
[ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
[ 1330.352133]  ? lru_cache_add+0x104/0x220
[ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
[ 1330.352138]  ? sched_clock_cpu+0xd/0x110
[ 1330.352143]  ? lock_release+0x155/0x4a0
[ 1330.352151]  read_pages+0x86/0x270
[ 1330.352155]  ? lru_cache_add+0x125/0x220
[ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
[ 1330.352172]  filemap_fault+0x626/0xa20
[ 1330.352176]  ? filemap_map_pages+0x18b/0x660
[ 1330.352184]  __do_fault+0x36/0xf0
[ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
[ 1330.352203]  handle_mm_fault+0x9e/0x260
[ 1330.352208]  __get_user_pages+0x204/0x620
[ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
[ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
[ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
[ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
[ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
[ 1330.352259]  ? lock_release+0x155/0x4a0
[ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
[ 1330.352275]  iomap_apply+0x12a/0x4a0
[ 1330.352278]  ? iomap_dio_rw+0x30/0x30
[ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
[ 1330.352294]  ? iomap_dio_rw+0x30/0x30
[ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
[ 1330.352339]  new_sync_write+0x11f/0x1b0
[ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
[ 1330.352354]  vfs_write+0x292/0x3c0
[ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
[ 1330.352365]  do_syscall_64+0x3b/0xc0
[ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1330.352372] RIP: 0033:0x7f4b0a580986
[ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
[ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
[ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
[ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).

Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.

This depends on the iov_iter and iomap changes done by a recent patchset
from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
moment of this writing. The cover letter has the following subject:

   "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"

The thread can be found at:

https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/

Fixing these issues could be done without the iov_iter and iomap changes
introduced in that patchset, however it would be much more complex due to
the need of reordering some operations for writes and having to be able
to pass some state through nested and deep call chains, which would be
particularly cumbersome for reads - for example make the readahead and
the endio handlers for page reads be aware we are in a direct IO read
context and know which inode and extent range we locked before.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Updated read path to fallback to buffered IO in case it's taking a long
    time to make any progress, fixing an issue reported by Wang Yugui.
    Rebased on the v8 patchset on which it depends on and on current misc-next.

 fs/btrfs/file.c | 137 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 121 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 581662d16b72..58c94205d325 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1912,16 +1912,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 
 static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
+	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	loff_t pos;
 	ssize_t written = 0;
 	ssize_t written_buffered;
+	size_t prev_left = 0;
 	loff_t endbyte;
 	ssize_t err;
 	unsigned int ilock_flags = 0;
-	struct iomap_dio *dio = NULL;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -1964,23 +1965,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		goto buffered;
 	}
 
-	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			     0, 0);
+	/*
+	 * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
+	 * calls generic_write_sync() (through iomap_dio_complete()), because
+	 * that results in calling fsync (btrfs_sync_file()) which will try to
+	 * lock the inode in exclusive/write mode.
+	 */
+	if (is_sync_write)
+		iocb->ki_flags &= ~IOCB_DSYNC;
 
-	btrfs_inode_unlock(inode, ilock_flags);
+	/*
+	 * The iov_iter can be mapped to the same file range we are writing to.
+	 * If that's the case, then we will deadlock in the iomap code, because
+	 * it first calls our callback btrfs_dio_iomap_begin(), which will create
+	 * an ordered extent, and after that it will fault in the pages that the
+	 * iov_iter refers to. During the fault in we end up in the readahead
+	 * pages code (starting at btrfs_readahead()), which will lock the range,
+	 * find that ordered extent and then wait for it to complete (at
+	 * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
+	 * obviously the ordered extent can never complete as we didn't submit
+	 * yet the respective bio(s). This always happens when the buffer is
+	 * memory mapped to the same file range, since the iomap DIO code always
+	 * invalidates pages in the target file range (after starting and waiting
+	 * for any writeback).
+	 *
+	 * So here we disable page faults in the iov_iter and then retry if we
+	 * got -EFAULT, faulting in the pages before the retry.
+	 */
+again:
+	from->nofault = true;
+	err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+			   IOMAP_DIO_PARTIAL, written);
+	from->nofault = false;
 
-	if (IS_ERR_OR_NULL(dio)) {
-		err = PTR_ERR_OR_ZERO(dio);
-		if (err < 0 && err != -ENOTBLK)
-			goto out;
-	} else {
-		written = iomap_dio_complete(dio);
+	if (err > 0)
+		written = err;
+
+	if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
+		const size_t left = iov_iter_count(from);
+		/*
+		 * We have more data left to write. Try to fault in as many as
+		 * possible of the remainder pages and retry. We do this without
+		 * releasing and locking again the inode, to prevent races with
+		 * truncate.
+		 *
+		 * Also, in case the iov refers to pages in the file range of the
+		 * file we want to write to (due to a mmap), we could enter an
+		 * infinite loop if we retry after faulting the pages in, since
+		 * iomap will invalidate any pages in the range early on, before
+		 * it tries to fault in the pages of the iov. So we keep track of
+		 * how much was left of iov in the previous EFAULT and fallback
+		 * to buffered IO in case we haven't made any progress.
+		 */
+		if (left == prev_left) {
+			err = -ENOTBLK;
+		} else {
+			fault_in_iov_iter_readable(from, left);
+			prev_left = left;
+			goto again;
+		}
 	}
 
-	if (written < 0 || !iov_iter_count(from)) {
-		err = written;
+	btrfs_inode_unlock(inode, ilock_flags);
+
+	/*
+	 * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
+	 * the fsync (call generic_write_sync()).
+	 */
+	if (is_sync_write)
+		iocb->ki_flags |= IOCB_DSYNC;
+
+	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
+	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
 		goto out;
-	}
 
 buffered:
 	pos = iocb->ki_pos;
@@ -2005,7 +2062,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
 				 endbyte >> PAGE_SHIFT);
 out:
-	return written ? written : err;
+	return err < 0 ? err : written;
 }
 
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
@@ -3659,6 +3716,8 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
 static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	size_t prev_left = 0;
+	ssize_t read = 0;
 	ssize_t ret;
 
 	if (fsverity_active(inode))
@@ -3668,10 +3727,56 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		return 0;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
+again:
+	/*
+	 * This is similar to what we do for direct IO writes, see the comment
+	 * at btrfs_direct_write(), but we also disable page faults in addition
+	 * to disabling them only at the iov_iter level. This is because when
+	 * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
+	 * which can still trigger page fault ins despite having set ->nofault
+	 * to true of our 'to' iov_iter.
+	 *
+	 * The difference to direct IO writes is that we deadlock when trying
+	 * to lock the extent range in the inode's tree during he page reads
+	 * triggered by the fault in (while for writes it is due to waiting for
+	 * our own ordered extent). This is because for direct IO reads,
+	 * btrfs_dio_iomap_begin() returns with the extent range locked, which
+	 * is only unlocked in the endio callback (end_bio_extent_readpage()).
+	 */
+	pagefault_disable();
+	to->nofault = true;
 	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			   0, 0);
+			   IOMAP_DIO_PARTIAL, read);
+	to->nofault = false;
+	pagefault_enable();
+
+	if (ret > 0)
+		read = ret;
+
+	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
+		const size_t left = iov_iter_count(to);
+
+		if (left == prev_left) {
+			/*
+			 * We didn't make any progress since the last attempt,
+			 * fallback to a buffered read for the remainder of the
+			 * range. This is just to avoid any possibility of looping
+			 * for too long.
+			 */
+			ret = read;
+		} else {
+			/*
+			 * We made some progress since the last retry or this is
+			 * the first time we are retrying. Fault in as many pages
+			 * as possible and retry.
+			 */
+			fault_in_iov_iter_writeable(to, left);
+			prev_left = left;
+			goto again;
+		}
+	}
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
-	return ret;
+	return ret < 0 ? ret : read;
 }
 
 static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-25  9:42 ` [PATCH v2] " fdmanana
@ 2021-10-25 14:42   ` Josef Bacik
  2021-10-25 14:54     ` Filipe Manana
  0 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2021-10-25 14:42 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, wangyugui, Filipe Manana

On Mon, Oct 25, 2021 at 10:42:59AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If we do a direct IO read or write when the buffer given by the user is
> memory mapped to the file range we are going to do IO, we end up ending
> in a deadlock. This is triggered by the new test case generic/647 from
> fstests.
> 
> For a direct IO read we get a trace like this:
> 
> [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> [  967.875992] Call Trace:
> [  967.875999]  __schedule+0x3ca/0xe10
> [  967.876015]  schedule+0x43/0xe0
> [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> [  967.876253]  ? lru_cache_add+0x104/0x220
> [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> [  967.876258]  ? sched_clock_cpu+0xd/0x110
> [  967.876263]  ? lock_release+0x155/0x4a0
> [  967.876271]  read_pages+0x86/0x270
> [  967.876274]  ? lru_cache_add+0x125/0x220
> [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> [  967.876291]  filemap_fault+0x626/0xa20
> [  967.876303]  __do_fault+0x36/0xf0
> [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> [  967.876322]  handle_mm_fault+0x9e/0x260
> [  967.876327]  __get_user_pages+0x204/0x620
> [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> [  967.876379]  ? lock_release+0x155/0x4a0
> [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> [  967.876396]  iomap_apply+0x12a/0x4a0
> [  967.876398]  ? iomap_dio_rw+0x30/0x30
> [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> [  967.876415]  ? iomap_dio_rw+0x30/0x30
> [  967.876420]  ? lock_acquired+0xf3/0x420
> [  967.876429]  iomap_dio_rw+0xa/0x30
> [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> [  967.876460]  new_sync_read+0x118/0x1a0
> [  967.876472]  vfs_read+0x128/0x1b0
> [  967.876477]  __x64_sys_pread64+0x90/0xc0
> [  967.876483]  do_syscall_64+0x3b/0xc0
> [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  967.876490] RIP: 0033:0x7fb6f2c038d6
> [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> 
> This happens because at btrfs_dio_iomap_begin() we lock the extent range
> and return with it locked - we only unlock in the endio callback, at
> end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> faults that resulting in reading the pages, through the readahead callback
> btrfs_readahead(), and through there we end to attempt to lock again the
> same extent range (or a subrange of what we locked before), resulting in
> the deadlock.
> 
> For a direct IO write, the scenario is a bit different, and it results in
> trace like this:
> 
> [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> [ 1330.351906] Call Trace:
> [ 1330.351913]  __schedule+0x3ca/0xe10
> [ 1330.351930]  schedule+0x43/0xe0
> [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> [ 1330.352133]  ? lru_cache_add+0x104/0x220
> [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> [ 1330.352143]  ? lock_release+0x155/0x4a0
> [ 1330.352151]  read_pages+0x86/0x270
> [ 1330.352155]  ? lru_cache_add+0x125/0x220
> [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> [ 1330.352172]  filemap_fault+0x626/0xa20
> [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> [ 1330.352184]  __do_fault+0x36/0xf0
> [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> [ 1330.352203]  handle_mm_fault+0x9e/0x260
> [ 1330.352208]  __get_user_pages+0x204/0x620
> [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> [ 1330.352259]  ? lock_release+0x155/0x4a0
> [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> [ 1330.352275]  iomap_apply+0x12a/0x4a0
> [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> [ 1330.352339]  new_sync_write+0x11f/0x1b0
> [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> [ 1330.352354]  vfs_write+0x292/0x3c0
> [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> [ 1330.352365]  do_syscall_64+0x3b/0xc0
> [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 1330.352372] RIP: 0033:0x7f4b0a580986
> [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 
> Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> range unlocked, but later when the page faults are triggered and we try
> to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> we find the ordered extent for our write, created by the iomap callback
> btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> deadlock since we can't complete the ordered extent without reading the
> pages (the iomap code only submits the bio after the pages are faulted
> in).
> 
> Fix this by setting the nofault attribute of the given iov_iter and retry
> the direct IO read/write if we get an -EFAULT error returned from iomap.
> For reads, also disable page faults completely, this is because when we
> read from a hole or a prealloc extent, we can still trigger page faults
> due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> oblivious to the value of the ->nofault attribute of an iov_iter.
> We also need to keep track of the number of bytes written or read, and
> pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> 
> This depends on the iov_iter and iomap changes done by a recent patchset
> from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> moment of this writing. The cover letter has the following subject:
> 
>    "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"
> 
> The thread can be found at:
> 
> https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/
> 
> Fixing these issues could be done without the iov_iter and iomap changes
> introduced in that patchset, however it would be much more complex due to
> the need of reordering some operations for writes and having to be able
> to pass some state through nested and deep call chains, which would be
> particularly cumbersome for reads - for example make the readahead and
> the endio handlers for page reads be aware we are in a direct IO read
> context and know which inode and extent range we locked before.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V2: Updated read path to fallback to buffered IO in case it's taking a long
>     time to make any progress, fixing an issue reported by Wang Yugui.
>     Rebased on the v8 patchset on which it depends on and on current misc-next.
> 
>  fs/btrfs/file.c | 137 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 121 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 581662d16b72..58c94205d325 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1912,16 +1912,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
>  
>  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  {
> +	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file_inode(file);
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	loff_t pos;
>  	ssize_t written = 0;
>  	ssize_t written_buffered;
> +	size_t prev_left = 0;
>  	loff_t endbyte;
>  	ssize_t err;
>  	unsigned int ilock_flags = 0;
> -	struct iomap_dio *dio = NULL;
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		ilock_flags |= BTRFS_ILOCK_TRY;
> @@ -1964,23 +1965,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  		goto buffered;
>  	}
>  
> -	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			     0, 0);

Since we're the only users of this (and iomap_dio_complete) you could have some
follow-up patches that get rid of these helpers and the export of
iomap_dio_complete().

> +	/*
> +	 * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> +	 * calls generic_write_sync() (through iomap_dio_complete()), because
> +	 * that results in calling fsync (btrfs_sync_file()) which will try to
> +	 * lock the inode in exclusive/write mode.
> +	 */
> +	if (is_sync_write)
> +		iocb->ki_flags &= ~IOCB_DSYNC;
>  
> -	btrfs_inode_unlock(inode, ilock_flags);
> +	/*
> +	 * The iov_iter can be mapped to the same file range we are writing to.
> +	 * If that's the case, then we will deadlock in the iomap code, because
> +	 * it first calls our callback btrfs_dio_iomap_begin(), which will create
> +	 * an ordered extent, and after that it will fault in the pages that the
> +	 * iov_iter refers to. During the fault in we end up in the readahead
> +	 * pages code (starting at btrfs_readahead()), which will lock the range,
> +	 * find that ordered extent and then wait for it to complete (at
> +	 * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> +	 * obviously the ordered extent can never complete as we didn't submit
> +	 * yet the respective bio(s). This always happens when the buffer is
> +	 * memory mapped to the same file range, since the iomap DIO code always
> +	 * invalidates pages in the target file range (after starting and waiting
> +	 * for any writeback).
> +	 *
> +	 * So here we disable page faults in the iov_iter and then retry if we
> +	 * got -EFAULT, faulting in the pages before the retry.
> +	 */
> +again:
> +	from->nofault = true;
> +	err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> +			   IOMAP_DIO_PARTIAL, written);
> +	from->nofault = false;
>  
> -	if (IS_ERR_OR_NULL(dio)) {
> -		err = PTR_ERR_OR_ZERO(dio);
> -		if (err < 0 && err != -ENOTBLK)
> -			goto out;
> -	} else {
> -		written = iomap_dio_complete(dio);
> +	if (err > 0)
> +		written = err;

Should this be

	written += err;

if the first guy is a partial write and then the next one we get is full we'll
lose the first part of the partial write.

<snip>

> @@ -3668,10 +3727,56 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  		return 0;
>  
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> +again:
> +	/*
> +	 * This is similar to what we do for direct IO writes, see the comment
> +	 * at btrfs_direct_write(), but we also disable page faults in addition
> +	 * to disabling them only at the iov_iter level. This is because when
> +	 * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
> +	 * which can still trigger page fault ins despite having set ->nofault
> +	 * to true of our 'to' iov_iter.
> +	 *
> +	 * The difference to direct IO writes is that we deadlock when trying
> +	 * to lock the extent range in the inode's tree during he page reads
> +	 * triggered by the fault in (while for writes it is due to waiting for
> +	 * our own ordered extent). This is because for direct IO reads,
> +	 * btrfs_dio_iomap_begin() returns with the extent range locked, which
> +	 * is only unlocked in the endio callback (end_bio_extent_readpage()).
> +	 */
> +	pagefault_disable();
> +	to->nofault = true;
>  	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			   0, 0);
> +			   IOMAP_DIO_PARTIAL, read);
> +	to->nofault = false;
> +	pagefault_enable();
> +
> +	if (ret > 0)
> +		read = ret;
> +

Same here, this should be

	read += ret;

right?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-25 14:42   ` Josef Bacik
@ 2021-10-25 14:54     ` Filipe Manana
  2021-10-25 16:11       ` Josef Bacik
  0 siblings, 1 reply; 18+ messages in thread
From: Filipe Manana @ 2021-10-25 14:54 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Wang Yugui, Filipe Manana

On Mon, Oct 25, 2021 at 3:42 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Mon, Oct 25, 2021 at 10:42:59AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > If we do a direct IO read or write when the buffer given by the user is
> > memory mapped to the file range we are going to do IO, we end up ending
> > in a deadlock. This is triggered by the new test case generic/647 from
> > fstests.
> >
> > For a direct IO read we get a trace like this:
> >
> > [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> > [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> > [  967.875992] Call Trace:
> > [  967.875999]  __schedule+0x3ca/0xe10
> > [  967.876015]  schedule+0x43/0xe0
> > [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> > [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> > [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> > [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> > [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> > [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> > [  967.876253]  ? lru_cache_add+0x104/0x220
> > [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> > [  967.876258]  ? sched_clock_cpu+0xd/0x110
> > [  967.876263]  ? lock_release+0x155/0x4a0
> > [  967.876271]  read_pages+0x86/0x270
> > [  967.876274]  ? lru_cache_add+0x125/0x220
> > [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> > [  967.876291]  filemap_fault+0x626/0xa20
> > [  967.876303]  __do_fault+0x36/0xf0
> > [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> > [  967.876322]  handle_mm_fault+0x9e/0x260
> > [  967.876327]  __get_user_pages+0x204/0x620
> > [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> > [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> > [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> > [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> > [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> > [  967.876379]  ? lock_release+0x155/0x4a0
> > [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> > [  967.876396]  iomap_apply+0x12a/0x4a0
> > [  967.876398]  ? iomap_dio_rw+0x30/0x30
> > [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> > [  967.876415]  ? iomap_dio_rw+0x30/0x30
> > [  967.876420]  ? lock_acquired+0xf3/0x420
> > [  967.876429]  iomap_dio_rw+0xa/0x30
> > [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> > [  967.876460]  new_sync_read+0x118/0x1a0
> > [  967.876472]  vfs_read+0x128/0x1b0
> > [  967.876477]  __x64_sys_pread64+0x90/0xc0
> > [  967.876483]  do_syscall_64+0x3b/0xc0
> > [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  967.876490] RIP: 0033:0x7fb6f2c038d6
> > [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> > [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> > [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> > [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> > [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> > [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> >
> > This happens because at btrfs_dio_iomap_begin() we lock the extent range
> > and return with it locked - we only unlock in the endio callback, at
> > end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> > iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> > faults that resulting in reading the pages, through the readahead callback
> > btrfs_readahead(), and through there we end to attempt to lock again the
> > same extent range (or a subrange of what we locked before), resulting in
> > the deadlock.
> >
> > For a direct IO write, the scenario is a bit different, and it results in
> > trace like this:
> >
> > [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> > [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> > [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> > [ 1330.351906] Call Trace:
> > [ 1330.351913]  __schedule+0x3ca/0xe10
> > [ 1330.351930]  schedule+0x43/0xe0
> > [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> > [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> > [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> > [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> > [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> > [ 1330.352133]  ? lru_cache_add+0x104/0x220
> > [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> > [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> > [ 1330.352143]  ? lock_release+0x155/0x4a0
> > [ 1330.352151]  read_pages+0x86/0x270
> > [ 1330.352155]  ? lru_cache_add+0x125/0x220
> > [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> > [ 1330.352172]  filemap_fault+0x626/0xa20
> > [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> > [ 1330.352184]  __do_fault+0x36/0xf0
> > [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> > [ 1330.352203]  handle_mm_fault+0x9e/0x260
> > [ 1330.352208]  __get_user_pages+0x204/0x620
> > [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> > [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> > [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> > [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> > [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> > [ 1330.352259]  ? lock_release+0x155/0x4a0
> > [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> > [ 1330.352275]  iomap_apply+0x12a/0x4a0
> > [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> > [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> > [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> > [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> > [ 1330.352339]  new_sync_write+0x11f/0x1b0
> > [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> > [ 1330.352354]  vfs_write+0x292/0x3c0
> > [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> > [ 1330.352365]  do_syscall_64+0x3b/0xc0
> > [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [ 1330.352372] RIP: 0033:0x7f4b0a580986
> > [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> > [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> > [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> > [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> > [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> > [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >
> > Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> > range unlocked, but later when the page faults are triggered and we try
> > to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> > we find the ordered extent for our write, created by the iomap callback
> > btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> > deadlock since we can't complete the ordered extent without reading the
> > pages (the iomap code only submits the bio after the pages are faulted
> > in).
> >
> > Fix this by setting the nofault attribute of the given iov_iter and retry
> > the direct IO read/write if we get an -EFAULT error returned from iomap.
> > For reads, also disable page faults completely, this is because when we
> > read from a hole or a prealloc extent, we can still trigger page faults
> > due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> > oblivious to the value of the ->nofault attribute of an iov_iter.
> > We also need to keep track of the number of bytes written or read, and
> > pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> >
> > This depends on the iov_iter and iomap changes done by a recent patchset
> > from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> > moment of this writing. The cover letter has the following subject:
> >
> >    "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"
> >
> > The thread can be found at:
> >
> > https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/
> >
> > Fixing these issues could be done without the iov_iter and iomap changes
> > introduced in that patchset, however it would be much more complex due to
> > the need of reordering some operations for writes and having to be able
> > to pass some state through nested and deep call chains, which would be
> > particularly cumbersome for reads - for example make the readahead and
> > the endio handlers for page reads be aware we are in a direct IO read
> > context and know which inode and extent range we locked before.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >
> > V2: Updated read path to fallback to buffered IO in case it's taking a long
> >     time to make any progress, fixing an issue reported by Wang Yugui.
> >     Rebased on the v8 patchset on which it depends on and on current misc-next.
> >
> >  fs/btrfs/file.c | 137 ++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 121 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 581662d16b72..58c94205d325 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1912,16 +1912,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
> >
> >  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >  {
> > +     const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
> >       struct file *file = iocb->ki_filp;
> >       struct inode *inode = file_inode(file);
> >       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >       loff_t pos;
> >       ssize_t written = 0;
> >       ssize_t written_buffered;
> > +     size_t prev_left = 0;
> >       loff_t endbyte;
> >       ssize_t err;
> >       unsigned int ilock_flags = 0;
> > -     struct iomap_dio *dio = NULL;
> >
> >       if (iocb->ki_flags & IOCB_NOWAIT)
> >               ilock_flags |= BTRFS_ILOCK_TRY;
> > @@ -1964,23 +1965,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >               goto buffered;
> >       }
> >
> > -     dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > -                          0, 0);
>
> Since we're the only users of this (and iomap_dio_complete) you could have some
> follow-up patches that get rid of these helpers and the export of
> iomap_dio_complete().

Yes, I had noticed that.
But since the patchset was not yet merged, I decided to leave that for later.

>
> > +     /*
> > +      * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> > +      * calls generic_write_sync() (through iomap_dio_complete()), because
> > +      * that results in calling fsync (btrfs_sync_file()) which will try to
> > +      * lock the inode in exclusive/write mode.
> > +      */
> > +     if (is_sync_write)
> > +             iocb->ki_flags &= ~IOCB_DSYNC;
> >
> > -     btrfs_inode_unlock(inode, ilock_flags);
> > +     /*
> > +      * The iov_iter can be mapped to the same file range we are writing to.
> > +      * If that's the case, then we will deadlock in the iomap code, because
> > +      * it first calls our callback btrfs_dio_iomap_begin(), which will create
> > +      * an ordered extent, and after that it will fault in the pages that the
> > +      * iov_iter refers to. During the fault in we end up in the readahead
> > +      * pages code (starting at btrfs_readahead()), which will lock the range,
> > +      * find that ordered extent and then wait for it to complete (at
> > +      * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> > +      * obviously the ordered extent can never complete as we didn't submit
> > +      * yet the respective bio(s). This always happens when the buffer is
> > +      * memory mapped to the same file range, since the iomap DIO code always
> > +      * invalidates pages in the target file range (after starting and waiting
> > +      * for any writeback).
> > +      *
> > +      * So here we disable page faults in the iov_iter and then retry if we
> > +      * got -EFAULT, faulting in the pages before the retry.
> > +      */
> > +again:
> > +     from->nofault = true;
> > +     err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > +                        IOMAP_DIO_PARTIAL, written);
> > +     from->nofault = false;
> >
> > -     if (IS_ERR_OR_NULL(dio)) {
> > -             err = PTR_ERR_OR_ZERO(dio);
> > -             if (err < 0 && err != -ENOTBLK)
> > -                     goto out;
> > -     } else {
> > -             written = iomap_dio_complete(dio);
> > +     if (err > 0)
> > +             written = err;
>
> Should this be
>
>         written += err;
>
> if the first guy is a partial write and then the next one we get is full we'll
> lose the first part of the partial write.

Nop, the iomap code does the +=, see:

https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-15-agruenba@redhat.com/

>
> <snip>
>
> > @@ -3668,10 +3727,56 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> >               return 0;
> >
> >       btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> > +again:
> > +     /*
> > +      * This is similar to what we do for direct IO writes, see the comment
> > +      * at btrfs_direct_write(), but we also disable page faults in addition
> > +      * to disabling them only at the iov_iter level. This is because when
> > +      * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
> > +      * which can still trigger page fault ins despite having set ->nofault
> > +      * to true of our 'to' iov_iter.
> > +      *
> > +      * The difference to direct IO writes is that we deadlock when trying
> > +      * to lock the extent range in the inode's tree during he page reads
> > +      * triggered by the fault in (while for writes it is due to waiting for
> > +      * our own ordered extent). This is because for direct IO reads,
> > +      * btrfs_dio_iomap_begin() returns with the extent range locked, which
> > +      * is only unlocked in the endio callback (end_bio_extent_readpage()).
> > +      */
> > +     pagefault_disable();
> > +     to->nofault = true;
> >       ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > -                        0, 0);
> > +                        IOMAP_DIO_PARTIAL, read);
> > +     to->nofault = false;
> > +     pagefault_enable();
> > +
> > +     if (ret > 0)
> > +             read = ret;
> > +
>
> Same here, this should be
>
>         read += ret;
>
> right?  Thanks,

Nop, same as above.

Thanks.

>
> Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-25 14:54     ` Filipe Manana
@ 2021-10-25 16:11       ` Josef Bacik
  0 siblings, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2021-10-25 16:11 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, Wang Yugui, Filipe Manana

On Mon, Oct 25, 2021 at 03:54:59PM +0100, Filipe Manana wrote:
> On Mon, Oct 25, 2021 at 3:42 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Mon, Oct 25, 2021 at 10:42:59AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > If we do a direct IO read or write when the buffer given by the user is
> > > memory mapped to the file range we are going to do IO, we end up ending
> > > in a deadlock. This is triggered by the new test case generic/647 from
> > > fstests.
> > >
> > > For a direct IO read we get a trace like this:
> > >
> > > [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> > > [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > > [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> > > [  967.875992] Call Trace:
> > > [  967.875999]  __schedule+0x3ca/0xe10
> > > [  967.876015]  schedule+0x43/0xe0
> > > [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> > > [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> > > [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> > > [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> > > [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> > > [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> > > [  967.876253]  ? lru_cache_add+0x104/0x220
> > > [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> > > [  967.876258]  ? sched_clock_cpu+0xd/0x110
> > > [  967.876263]  ? lock_release+0x155/0x4a0
> > > [  967.876271]  read_pages+0x86/0x270
> > > [  967.876274]  ? lru_cache_add+0x125/0x220
> > > [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> > > [  967.876291]  filemap_fault+0x626/0xa20
> > > [  967.876303]  __do_fault+0x36/0xf0
> > > [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> > > [  967.876322]  handle_mm_fault+0x9e/0x260
> > > [  967.876327]  __get_user_pages+0x204/0x620
> > > [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> > > [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> > > [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> > > [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> > > [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> > > [  967.876379]  ? lock_release+0x155/0x4a0
> > > [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> > > [  967.876396]  iomap_apply+0x12a/0x4a0
> > > [  967.876398]  ? iomap_dio_rw+0x30/0x30
> > > [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> > > [  967.876415]  ? iomap_dio_rw+0x30/0x30
> > > [  967.876420]  ? lock_acquired+0xf3/0x420
> > > [  967.876429]  iomap_dio_rw+0xa/0x30
> > > [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> > > [  967.876460]  new_sync_read+0x118/0x1a0
> > > [  967.876472]  vfs_read+0x128/0x1b0
> > > [  967.876477]  __x64_sys_pread64+0x90/0xc0
> > > [  967.876483]  do_syscall_64+0x3b/0xc0
> > > [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [  967.876490] RIP: 0033:0x7fb6f2c038d6
> > > [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> > > [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> > > [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> > > [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> > > [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> > > [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> > >
> > > This happens because at btrfs_dio_iomap_begin() we lock the extent range
> > > and return with it locked - we only unlock in the endio callback, at
> > > end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> > > iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> > > faults that resulting in reading the pages, through the readahead callback
> > > btrfs_readahead(), and through there we end to attempt to lock again the
> > > same extent range (or a subrange of what we locked before), resulting in
> > > the deadlock.
> > >
> > > For a direct IO write, the scenario is a bit different, and it results in
> > > trace like this:
> > >
> > > [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> > > [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> > > [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> > > [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> > > [ 1330.351906] Call Trace:
> > > [ 1330.351913]  __schedule+0x3ca/0xe10
> > > [ 1330.351930]  schedule+0x43/0xe0
> > > [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> > > [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> > > [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> > > [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> > > [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> > > [ 1330.352133]  ? lru_cache_add+0x104/0x220
> > > [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> > > [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> > > [ 1330.352143]  ? lock_release+0x155/0x4a0
> > > [ 1330.352151]  read_pages+0x86/0x270
> > > [ 1330.352155]  ? lru_cache_add+0x125/0x220
> > > [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> > > [ 1330.352172]  filemap_fault+0x626/0xa20
> > > [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> > > [ 1330.352184]  __do_fault+0x36/0xf0
> > > [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> > > [ 1330.352203]  handle_mm_fault+0x9e/0x260
> > > [ 1330.352208]  __get_user_pages+0x204/0x620
> > > [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> > > [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> > > [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> > > [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> > > [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> > > [ 1330.352259]  ? lock_release+0x155/0x4a0
> > > [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> > > [ 1330.352275]  iomap_apply+0x12a/0x4a0
> > > [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> > > [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> > > [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> > > [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> > > [ 1330.352339]  new_sync_write+0x11f/0x1b0
> > > [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> > > [ 1330.352354]  vfs_write+0x292/0x3c0
> > > [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> > > [ 1330.352365]  do_syscall_64+0x3b/0xc0
> > > [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [ 1330.352372] RIP: 0033:0x7f4b0a580986
> > > [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> > > [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> > > [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> > > [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> > > [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> > > [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > >
> > > Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> > > range unlocked, but later when the page faults are triggered and we try
> > > to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> > > we find the ordered extent for our write, created by the iomap callback
> > > btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> > > deadlock since we can't complete the ordered extent without reading the
> > > pages (the iomap code only submits the bio after the pages are faulted
> > > in).
> > >
> > > Fix this by setting the nofault attribute of the given iov_iter and retry
> > > the direct IO read/write if we get an -EFAULT error returned from iomap.
> > > For reads, also disable page faults completely, this is because when we
> > > read from a hole or a prealloc extent, we can still trigger page faults
> > > due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> > > oblivious to the value of the ->nofault attribute of an iov_iter.
> > > We also need to keep track of the number of bytes written or read, and
> > > pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> > >
> > > This depends on the iov_iter and iomap changes done by a recent patchset
> > > from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> > > moment of this writing. The cover letter has the following subject:
> > >
> > >    "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"
> > >
> > > The thread can be found at:
> > >
> > > https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/
> > >
> > > Fixing these issues could be done without the iov_iter and iomap changes
> > > introduced in that patchset, however it would be much more complex due to
> > > the need of reordering some operations for writes and having to be able
> > > to pass some state through nested and deep call chains, which would be
> > > particularly cumbersome for reads - for example make the readahead and
> > > the endio handlers for page reads be aware we are in a direct IO read
> > > context and know which inode and extent range we locked before.
> > >
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >
> > > V2: Updated read path to fallback to buffered IO in case it's taking a long
> > >     time to make any progress, fixing an issue reported by Wang Yugui.
> > >     Rebased on the v8 patchset on which it depends on and on current misc-next.
> > >
> > >  fs/btrfs/file.c | 137 ++++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 121 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index 581662d16b72..58c94205d325 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -1912,16 +1912,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
> > >
> > >  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> > >  {
> > > +     const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
> > >       struct file *file = iocb->ki_filp;
> > >       struct inode *inode = file_inode(file);
> > >       struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > >       loff_t pos;
> > >       ssize_t written = 0;
> > >       ssize_t written_buffered;
> > > +     size_t prev_left = 0;
> > >       loff_t endbyte;
> > >       ssize_t err;
> > >       unsigned int ilock_flags = 0;
> > > -     struct iomap_dio *dio = NULL;
> > >
> > >       if (iocb->ki_flags & IOCB_NOWAIT)
> > >               ilock_flags |= BTRFS_ILOCK_TRY;
> > > @@ -1964,23 +1965,79 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> > >               goto buffered;
> > >       }
> > >
> > > -     dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > > -                          0, 0);
> >
> > Since we're the only users of this (and iomap_dio_complete) you could have some
> > follow-up patches that get rid of these helpers and the export of
> > iomap_dio_complete().
> 
> Yes, I had noticed that.
> But since the patchset was not yet merged, I decided to leave that for later.
> 

Yeah that's fair, just want to make sure we follow up afterwards.

> >
> > > +     /*
> > > +      * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> > > +      * calls generic_write_sync() (through iomap_dio_complete()), because
> > > +      * that results in calling fsync (btrfs_sync_file()) which will try to
> > > +      * lock the inode in exclusive/write mode.
> > > +      */
> > > +     if (is_sync_write)
> > > +             iocb->ki_flags &= ~IOCB_DSYNC;
> > >
> > > -     btrfs_inode_unlock(inode, ilock_flags);
> > > +     /*
> > > +      * The iov_iter can be mapped to the same file range we are writing to.
> > > +      * If that's the case, then we will deadlock in the iomap code, because
> > > +      * it first calls our callback btrfs_dio_iomap_begin(), which will create
> > > +      * an ordered extent, and after that it will fault in the pages that the
> > > +      * iov_iter refers to. During the fault in we end up in the readahead
> > > +      * pages code (starting at btrfs_readahead()), which will lock the range,
> > > +      * find that ordered extent and then wait for it to complete (at
> > > +      * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> > > +      * obviously the ordered extent can never complete as we didn't submit
> > > +      * yet the respective bio(s). This always happens when the buffer is
> > > +      * memory mapped to the same file range, since the iomap DIO code always
> > > +      * invalidates pages in the target file range (after starting and waiting
> > > +      * for any writeback).
> > > +      *
> > > +      * So here we disable page faults in the iov_iter and then retry if we
> > > +      * got -EFAULT, faulting in the pages before the retry.
> > > +      */
> > > +again:
> > > +     from->nofault = true;
> > > +     err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > > +                        IOMAP_DIO_PARTIAL, written);
> > > +     from->nofault = false;
> > >
> > > -     if (IS_ERR_OR_NULL(dio)) {
> > > -             err = PTR_ERR_OR_ZERO(dio);
> > > -             if (err < 0 && err != -ENOTBLK)
> > > -                     goto out;
> > > -     } else {
> > > -             written = iomap_dio_complete(dio);
> > > +     if (err > 0)
> > > +             written = err;
> >
> > Should this be
> >
> >         written += err;
> >
> > if the first guy is a partial write and then the next one we get is full we'll
> > lose the first part of the partial write.
> 
> Nop, the iomap code does the +=, see:
> 
> https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-15-agruenba@redhat.com/
> 

Hmm can you add a comment then, because I will for sure look at this in the
future and be confused as fuck.  And it appears I'm not the only one as Darrick
asked for similar comments to be added to the generic code so he could remeber
how it works.  Just a simple

/*
 * iomap_dio_rw takes written as an argument, so it will return the cumulative
 * written on success.
 */

so my brain doesn't jump to conclusions.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-09-08 10:50 [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes fdmanana
                   ` (2 preceding siblings ...)
  2021-10-25  9:42 ` [PATCH v2] " fdmanana
@ 2021-10-25 16:27 ` fdmanana
  2021-10-25 18:58   ` Josef Bacik
  2021-11-09 11:27   ` Filipe Manana
  3 siblings, 2 replies; 18+ messages in thread
From: fdmanana @ 2021-10-25 16:27 UTC (permalink / raw)
  To: linux-btrfs; +Cc: wangyugui, josef, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.

For a direct IO read we get a trace like this:

[  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
[  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
[  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
[  967.875992] Call Trace:
[  967.875999]  __schedule+0x3ca/0xe10
[  967.876015]  schedule+0x43/0xe0
[  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
[  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
[  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
[  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
[  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
[  967.876253]  ? lru_cache_add+0x104/0x220
[  967.876255]  ? kvm_sched_clock_read+0x14/0x40
[  967.876258]  ? sched_clock_cpu+0xd/0x110
[  967.876263]  ? lock_release+0x155/0x4a0
[  967.876271]  read_pages+0x86/0x270
[  967.876274]  ? lru_cache_add+0x125/0x220
[  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
[  967.876291]  filemap_fault+0x626/0xa20
[  967.876303]  __do_fault+0x36/0xf0
[  967.876308]  __handle_mm_fault+0x83f/0x15f0
[  967.876322]  handle_mm_fault+0x9e/0x260
[  967.876327]  __get_user_pages+0x204/0x620
[  967.876332]  ? get_user_pages_unlocked+0x69/0x340
[  967.876340]  get_user_pages_unlocked+0xd3/0x340
[  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
[  967.876366]  iov_iter_get_pages+0x8d/0x3a0
[  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
[  967.876379]  ? lock_release+0x155/0x4a0
[  967.876387]  iomap_dio_bio_actor+0x232/0x410
[  967.876396]  iomap_apply+0x12a/0x4a0
[  967.876398]  ? iomap_dio_rw+0x30/0x30
[  967.876414]  __iomap_dio_rw+0x29f/0x5e0
[  967.876415]  ? iomap_dio_rw+0x30/0x30
[  967.876420]  ? lock_acquired+0xf3/0x420
[  967.876429]  iomap_dio_rw+0xa/0x30
[  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
[  967.876460]  new_sync_read+0x118/0x1a0
[  967.876472]  vfs_read+0x128/0x1b0
[  967.876477]  __x64_sys_pread64+0x90/0xc0
[  967.876483]  do_syscall_64+0x3b/0xc0
[  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  967.876490] RIP: 0033:0x7fb6f2c038d6
[  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
[  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
[  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
[  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000

This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.

For a direct IO write, the scenario is a bit different, and it results in
trace like this:

[ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
[ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
[ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
[ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
[ 1330.351906] Call Trace:
[ 1330.351913]  __schedule+0x3ca/0xe10
[ 1330.351930]  schedule+0x43/0xe0
[ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
[ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
[ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
[ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
[ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
[ 1330.352133]  ? lru_cache_add+0x104/0x220
[ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
[ 1330.352138]  ? sched_clock_cpu+0xd/0x110
[ 1330.352143]  ? lock_release+0x155/0x4a0
[ 1330.352151]  read_pages+0x86/0x270
[ 1330.352155]  ? lru_cache_add+0x125/0x220
[ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
[ 1330.352172]  filemap_fault+0x626/0xa20
[ 1330.352176]  ? filemap_map_pages+0x18b/0x660
[ 1330.352184]  __do_fault+0x36/0xf0
[ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
[ 1330.352203]  handle_mm_fault+0x9e/0x260
[ 1330.352208]  __get_user_pages+0x204/0x620
[ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
[ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
[ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
[ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
[ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
[ 1330.352259]  ? lock_release+0x155/0x4a0
[ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
[ 1330.352275]  iomap_apply+0x12a/0x4a0
[ 1330.352278]  ? iomap_dio_rw+0x30/0x30
[ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
[ 1330.352294]  ? iomap_dio_rw+0x30/0x30
[ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
[ 1330.352339]  new_sync_write+0x11f/0x1b0
[ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
[ 1330.352354]  vfs_write+0x292/0x3c0
[ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
[ 1330.352365]  do_syscall_64+0x3b/0xc0
[ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1330.352372] RIP: 0033:0x7f4b0a580986
[ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
[ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
[ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
[ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).

Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.

This depends on the iov_iter and iomap changes done by a recent patchset
from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
moment of this writing. The cover letter has the following subject:

   "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"

The thread can be found at:

https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/

Fixing these issues could be done without the iov_iter and iomap changes
introduced in that patchset, however it would be much more complex due to
the need of reordering some operations for writes and having to be able
to pass some state through nested and deep call chains, which would be
particularly cumbersome for reads - for example make the readahead and
the endio handlers for page reads be aware we are in a direct IO read
context and know which inode and extent range we locked before.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V3: Added comments about iomap return cumulative values for bytes read or
    written.

V2: Updated read path to fallback to buffered IO in case it's taking a long
    time to make any progress, fixing an issue reported by Wang Yugui.
    Rebased on the v8 patchset on which it depends on and on current misc-next.

 fs/btrfs/file.c | 139 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 123 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 581662d16b72..11204dbbe053 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1912,16 +1912,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 
 static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
+	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	loff_t pos;
 	ssize_t written = 0;
 	ssize_t written_buffered;
+	size_t prev_left = 0;
 	loff_t endbyte;
 	ssize_t err;
 	unsigned int ilock_flags = 0;
-	struct iomap_dio *dio = NULL;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -1964,23 +1965,80 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		goto buffered;
 	}
 
-	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			     0, 0);
+	/*
+	 * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
+	 * calls generic_write_sync() (through iomap_dio_complete()), because
+	 * that results in calling fsync (btrfs_sync_file()) which will try to
+	 * lock the inode in exclusive/write mode.
+	 */
+	if (is_sync_write)
+		iocb->ki_flags &= ~IOCB_DSYNC;
 
-	btrfs_inode_unlock(inode, ilock_flags);
+	/*
+	 * The iov_iter can be mapped to the same file range we are writing to.
+	 * If that's the case, then we will deadlock in the iomap code, because
+	 * it first calls our callback btrfs_dio_iomap_begin(), which will create
+	 * an ordered extent, and after that it will fault in the pages that the
+	 * iov_iter refers to. During the fault in we end up in the readahead
+	 * pages code (starting at btrfs_readahead()), which will lock the range,
+	 * find that ordered extent and then wait for it to complete (at
+	 * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
+	 * obviously the ordered extent can never complete as we didn't submit
+	 * yet the respective bio(s). This always happens when the buffer is
+	 * memory mapped to the same file range, since the iomap DIO code always
+	 * invalidates pages in the target file range (after starting and waiting
+	 * for any writeback).
+	 *
+	 * So here we disable page faults in the iov_iter and then retry if we
+	 * got -EFAULT, faulting in the pages before the retry.
+	 */
+again:
+	from->nofault = true;
+	err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+			   IOMAP_DIO_PARTIAL, written);
+	from->nofault = false;
 
-	if (IS_ERR_OR_NULL(dio)) {
-		err = PTR_ERR_OR_ZERO(dio);
-		if (err < 0 && err != -ENOTBLK)
-			goto out;
-	} else {
-		written = iomap_dio_complete(dio);
+	/* No increment (+=) because iomap returns a cumulative value. */
+	if (err > 0)
+		written = err;
+
+	if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
+		const size_t left = iov_iter_count(from);
+		/*
+		 * We have more data left to write. Try to fault in as many as
+		 * possible of the remainder pages and retry. We do this without
+		 * releasing and locking again the inode, to prevent races with
+		 * truncate.
+		 *
+		 * Also, in case the iov refers to pages in the file range of the
+		 * file we want to write to (due to a mmap), we could enter an
+		 * infinite loop if we retry after faulting the pages in, since
+		 * iomap will invalidate any pages in the range early on, before
+		 * it tries to fault in the pages of the iov. So we keep track of
+		 * how much was left of iov in the previous EFAULT and fallback
+		 * to buffered IO in case we haven't made any progress.
+		 */
+		if (left == prev_left) {
+			err = -ENOTBLK;
+		} else {
+			fault_in_iov_iter_readable(from, left);
+			prev_left = left;
+			goto again;
+		}
 	}
 
-	if (written < 0 || !iov_iter_count(from)) {
-		err = written;
+	btrfs_inode_unlock(inode, ilock_flags);
+
+	/*
+	 * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
+	 * the fsync (call generic_write_sync()).
+	 */
+	if (is_sync_write)
+		iocb->ki_flags |= IOCB_DSYNC;
+
+	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
+	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
 		goto out;
-	}
 
 buffered:
 	pos = iocb->ki_pos;
@@ -2005,7 +2063,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
 				 endbyte >> PAGE_SHIFT);
 out:
-	return written ? written : err;
+	return err < 0 ? err : written;
 }
 
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
@@ -3659,6 +3717,8 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
 static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	size_t prev_left = 0;
+	ssize_t read = 0;
 	ssize_t ret;
 
 	if (fsverity_active(inode))
@@ -3668,10 +3728,57 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		return 0;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
+again:
+	/*
+	 * This is similar to what we do for direct IO writes, see the comment
+	 * at btrfs_direct_write(), but we also disable page faults in addition
+	 * to disabling them only at the iov_iter level. This is because when
+	 * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
+	 * which can still trigger page fault ins despite having set ->nofault
+	 * to true of our 'to' iov_iter.
+	 *
+	 * The difference to direct IO writes is that we deadlock when trying
+	 * to lock the extent range in the inode's tree during he page reads
+	 * triggered by the fault in (while for writes it is due to waiting for
+	 * our own ordered extent). This is because for direct IO reads,
+	 * btrfs_dio_iomap_begin() returns with the extent range locked, which
+	 * is only unlocked in the endio callback (end_bio_extent_readpage()).
+	 */
+	pagefault_disable();
+	to->nofault = true;
 	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			   0, 0);
+			   IOMAP_DIO_PARTIAL, read);
+	to->nofault = false;
+	pagefault_enable();
+
+	/* No increment (+=) because iomap returns a cumulative value. */
+	if (ret > 0)
+		read = ret;
+
+	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
+		const size_t left = iov_iter_count(to);
+
+		if (left == prev_left) {
+			/*
+			 * We didn't make any progress since the last attempt,
+			 * fallback to a buffered read for the remainder of the
+			 * range. This is just to avoid any possibility of looping
+			 * for too long.
+			 */
+			ret = read;
+		} else {
+			/*
+			 * We made some progress since the last retry or this is
+			 * the first time we are retrying. Fault in as many pages
+			 * as possible and retry.
+			 */
+			fault_in_iov_iter_writeable(to, left);
+			prev_left = left;
+			goto again;
+		}
+	}
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
-	return ret;
+	return ret < 0 ? ret : read;
 }
 
 static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-25 16:27 ` [PATCH v3] " fdmanana
@ 2021-10-25 18:58   ` Josef Bacik
  2021-11-09 11:27   ` Filipe Manana
  1 sibling, 0 replies; 18+ messages in thread
From: Josef Bacik @ 2021-10-25 18:58 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, wangyugui, Filipe Manana

On Mon, Oct 25, 2021 at 05:27:47PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If we do a direct IO read or write when the buffer given by the user is
> memory mapped to the file range we are going to do IO, we end up ending
> in a deadlock. This is triggered by the new test case generic/647 from
> fstests.
> 
> For a direct IO read we get a trace like this:
> 
> [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> [  967.875992] Call Trace:
> [  967.875999]  __schedule+0x3ca/0xe10
> [  967.876015]  schedule+0x43/0xe0
> [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> [  967.876253]  ? lru_cache_add+0x104/0x220
> [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> [  967.876258]  ? sched_clock_cpu+0xd/0x110
> [  967.876263]  ? lock_release+0x155/0x4a0
> [  967.876271]  read_pages+0x86/0x270
> [  967.876274]  ? lru_cache_add+0x125/0x220
> [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> [  967.876291]  filemap_fault+0x626/0xa20
> [  967.876303]  __do_fault+0x36/0xf0
> [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> [  967.876322]  handle_mm_fault+0x9e/0x260
> [  967.876327]  __get_user_pages+0x204/0x620
> [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> [  967.876379]  ? lock_release+0x155/0x4a0
> [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> [  967.876396]  iomap_apply+0x12a/0x4a0
> [  967.876398]  ? iomap_dio_rw+0x30/0x30
> [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> [  967.876415]  ? iomap_dio_rw+0x30/0x30
> [  967.876420]  ? lock_acquired+0xf3/0x420
> [  967.876429]  iomap_dio_rw+0xa/0x30
> [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> [  967.876460]  new_sync_read+0x118/0x1a0
> [  967.876472]  vfs_read+0x128/0x1b0
> [  967.876477]  __x64_sys_pread64+0x90/0xc0
> [  967.876483]  do_syscall_64+0x3b/0xc0
> [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  967.876490] RIP: 0033:0x7fb6f2c038d6
> [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
> 
> This happens because at btrfs_dio_iomap_begin() we lock the extent range
> and return with it locked - we only unlock in the endio callback, at
> end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> faults that resulting in reading the pages, through the readahead callback
> btrfs_readahead(), and through there we end to attempt to lock again the
> same extent range (or a subrange of what we locked before), resulting in
> the deadlock.
> 
> For a direct IO write, the scenario is a bit different, and it results in
> trace like this:
> 
> [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> [ 1330.351906] Call Trace:
> [ 1330.351913]  __schedule+0x3ca/0xe10
> [ 1330.351930]  schedule+0x43/0xe0
> [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> [ 1330.352133]  ? lru_cache_add+0x104/0x220
> [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> [ 1330.352143]  ? lock_release+0x155/0x4a0
> [ 1330.352151]  read_pages+0x86/0x270
> [ 1330.352155]  ? lru_cache_add+0x125/0x220
> [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> [ 1330.352172]  filemap_fault+0x626/0xa20
> [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> [ 1330.352184]  __do_fault+0x36/0xf0
> [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> [ 1330.352203]  handle_mm_fault+0x9e/0x260
> [ 1330.352208]  __get_user_pages+0x204/0x620
> [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> [ 1330.352259]  ? lock_release+0x155/0x4a0
> [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> [ 1330.352275]  iomap_apply+0x12a/0x4a0
> [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> [ 1330.352339]  new_sync_write+0x11f/0x1b0
> [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> [ 1330.352354]  vfs_write+0x292/0x3c0
> [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> [ 1330.352365]  do_syscall_64+0x3b/0xc0
> [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 1330.352372] RIP: 0033:0x7f4b0a580986
> [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 
> Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> range unlocked, but later when the page faults are triggered and we try
> to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> we find the ordered extent for our write, created by the iomap callback
> btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> deadlock since we can't complete the ordered extent without reading the
> pages (the iomap code only submits the bio after the pages are faulted
> in).
> 
> Fix this by setting the nofault attribute of the given iov_iter and retry
> the direct IO read/write if we get an -EFAULT error returned from iomap.
> For reads, also disable page faults completely, this is because when we
> read from a hole or a prealloc extent, we can still trigger page faults
> due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> oblivious to the value of the ->nofault attribute of an iov_iter.
> We also need to keep track of the number of bytes written or read, and
> pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
> 
> This depends on the iov_iter and iomap changes done by a recent patchset
> from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> moment of this writing. The cover letter has the following subject:
> 
>    "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"
> 
> The thread can be found at:
> 
> https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/
> 
> Fixing these issues could be done without the iov_iter and iomap changes
> introduced in that patchset, however it would be much more complex due to
> the need of reordering some operations for writes and having to be able
> to pass some state through nested and deep call chains, which would be
> particularly cumbersome for reads - for example make the readahead and
> the endio handlers for page reads be aware we are in a direct IO read
> context and know which inode and extent range we locked before.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

I did my normal review thing with the pre-requisite patches applied, you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-10-25 16:27 ` [PATCH v3] " fdmanana
  2021-10-25 18:58   ` Josef Bacik
@ 2021-11-09 11:27   ` Filipe Manana
  2021-11-09 12:39     ` David Sterba
  1 sibling, 1 reply; 18+ messages in thread
From: Filipe Manana @ 2021-11-09 11:27 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

On Mon, Oct 25, 2021 at 10:25 PM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> If we do a direct IO read or write when the buffer given by the user is
> memory mapped to the file range we are going to do IO, we end up ending
> in a deadlock. This is triggered by the new test case generic/647 from
> fstests.
>
> For a direct IO read we get a trace like this:
>
> [  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
> [  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
> [  967.875992] Call Trace:
> [  967.875999]  __schedule+0x3ca/0xe10
> [  967.876015]  schedule+0x43/0xe0
> [  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> [  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
> [  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
> [  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
> [  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
> [  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
> [  967.876253]  ? lru_cache_add+0x104/0x220
> [  967.876255]  ? kvm_sched_clock_read+0x14/0x40
> [  967.876258]  ? sched_clock_cpu+0xd/0x110
> [  967.876263]  ? lock_release+0x155/0x4a0
> [  967.876271]  read_pages+0x86/0x270
> [  967.876274]  ? lru_cache_add+0x125/0x220
> [  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
> [  967.876291]  filemap_fault+0x626/0xa20
> [  967.876303]  __do_fault+0x36/0xf0
> [  967.876308]  __handle_mm_fault+0x83f/0x15f0
> [  967.876322]  handle_mm_fault+0x9e/0x260
> [  967.876327]  __get_user_pages+0x204/0x620
> [  967.876332]  ? get_user_pages_unlocked+0x69/0x340
> [  967.876340]  get_user_pages_unlocked+0xd3/0x340
> [  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
> [  967.876366]  iov_iter_get_pages+0x8d/0x3a0
> [  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
> [  967.876379]  ? lock_release+0x155/0x4a0
> [  967.876387]  iomap_dio_bio_actor+0x232/0x410
> [  967.876396]  iomap_apply+0x12a/0x4a0
> [  967.876398]  ? iomap_dio_rw+0x30/0x30
> [  967.876414]  __iomap_dio_rw+0x29f/0x5e0
> [  967.876415]  ? iomap_dio_rw+0x30/0x30
> [  967.876420]  ? lock_acquired+0xf3/0x420
> [  967.876429]  iomap_dio_rw+0xa/0x30
> [  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
> [  967.876460]  new_sync_read+0x118/0x1a0
> [  967.876472]  vfs_read+0x128/0x1b0
> [  967.876477]  __x64_sys_pread64+0x90/0xc0
> [  967.876483]  do_syscall_64+0x3b/0xc0
> [  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  967.876490] RIP: 0033:0x7fb6f2c038d6
> [  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
> [  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
> [  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
> [  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
> [  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
>
> This happens because at btrfs_dio_iomap_begin() we lock the extent range
> and return with it locked - we only unlock in the endio callback, at
> end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
> iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
> faults that resulting in reading the pages, through the readahead callback
> btrfs_readahead(), and through there we end to attempt to lock again the
> same extent range (or a subrange of what we locked before), resulting in
> the deadlock.
>
> For a direct IO write, the scenario is a bit different, and it results in
> trace like this:
>
> [ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
> [ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
> [ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
> [ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
> [ 1330.351906] Call Trace:
> [ 1330.351913]  __schedule+0x3ca/0xe10
> [ 1330.351930]  schedule+0x43/0xe0
> [ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
> [ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
> [ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
> [ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
> [ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
> [ 1330.352133]  ? lru_cache_add+0x104/0x220
> [ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
> [ 1330.352138]  ? sched_clock_cpu+0xd/0x110
> [ 1330.352143]  ? lock_release+0x155/0x4a0
> [ 1330.352151]  read_pages+0x86/0x270
> [ 1330.352155]  ? lru_cache_add+0x125/0x220
> [ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
> [ 1330.352172]  filemap_fault+0x626/0xa20
> [ 1330.352176]  ? filemap_map_pages+0x18b/0x660
> [ 1330.352184]  __do_fault+0x36/0xf0
> [ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
> [ 1330.352203]  handle_mm_fault+0x9e/0x260
> [ 1330.352208]  __get_user_pages+0x204/0x620
> [ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
> [ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
> [ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
> [ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
> [ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
> [ 1330.352259]  ? lock_release+0x155/0x4a0
> [ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
> [ 1330.352275]  iomap_apply+0x12a/0x4a0
> [ 1330.352278]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
> [ 1330.352294]  ? iomap_dio_rw+0x30/0x30
> [ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
> [ 1330.352339]  new_sync_write+0x11f/0x1b0
> [ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
> [ 1330.352354]  vfs_write+0x292/0x3c0
> [ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
> [ 1330.352365]  do_syscall_64+0x3b/0xc0
> [ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 1330.352372] RIP: 0033:0x7f4b0a580986
> [ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> [ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
> [ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
> [ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
> [ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>
> Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
> range unlocked, but later when the page faults are triggered and we try
> to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
> we find the ordered extent for our write, created by the iomap callback
> btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
> deadlock since we can't complete the ordered extent without reading the
> pages (the iomap code only submits the bio after the pages are faulted
> in).
>
> Fix this by setting the nofault attribute of the given iov_iter and retry
> the direct IO read/write if we get an -EFAULT error returned from iomap.
> For reads, also disable page faults completely, this is because when we
> read from a hole or a prealloc extent, we can still trigger page faults
> due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
> oblivious to the value of the ->nofault attribute of an iov_iter.
> We also need to keep track of the number of bytes written or read, and
> pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
>
> This depends on the iov_iter and iomap changes done by a recent patchset
> from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
> moment of this writing. The cover letter has the following subject:
>
>    "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"
>
> The thread can be found at:
>
> https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/

David, Linus merged that patchset (v9 actually, but without any impact
on this patch) last week, in merge commit
c03098d4b9ad76bca2966a8769dcfe59f7f85103.
Do you want me to update the change log to mention it was already
merged (and which commit)?

Thanks.

>
> Fixing these issues could be done without the iov_iter and iomap changes
> introduced in that patchset, however it would be much more complex due to
> the need of reordering some operations for writes and having to be able
> to pass some state through nested and deep call chains, which would be
> particularly cumbersome for reads - for example make the readahead and
> the endio handlers for page reads be aware we are in a direct IO read
> context and know which inode and extent range we locked before.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>
> V3: Added comments about iomap return cumulative values for bytes read or
>     written.
>
> V2: Updated read path to fallback to buffered IO in case it's taking a long
>     time to make any progress, fixing an issue reported by Wang Yugui.
>     Rebased on the v8 patchset on which it depends on and on current misc-next.
>
>  fs/btrfs/file.c | 139 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 123 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 581662d16b72..11204dbbe053 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1912,16 +1912,17 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
>
>  static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  {
> +       const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
>         struct file *file = iocb->ki_filp;
>         struct inode *inode = file_inode(file);
>         struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>         loff_t pos;
>         ssize_t written = 0;
>         ssize_t written_buffered;
> +       size_t prev_left = 0;
>         loff_t endbyte;
>         ssize_t err;
>         unsigned int ilock_flags = 0;
> -       struct iomap_dio *dio = NULL;
>
>         if (iocb->ki_flags & IOCB_NOWAIT)
>                 ilock_flags |= BTRFS_ILOCK_TRY;
> @@ -1964,23 +1965,80 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>                 goto buffered;
>         }
>
> -       dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -                            0, 0);
> +       /*
> +        * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
> +        * calls generic_write_sync() (through iomap_dio_complete()), because
> +        * that results in calling fsync (btrfs_sync_file()) which will try to
> +        * lock the inode in exclusive/write mode.
> +        */
> +       if (is_sync_write)
> +               iocb->ki_flags &= ~IOCB_DSYNC;
>
> -       btrfs_inode_unlock(inode, ilock_flags);
> +       /*
> +        * The iov_iter can be mapped to the same file range we are writing to.
> +        * If that's the case, then we will deadlock in the iomap code, because
> +        * it first calls our callback btrfs_dio_iomap_begin(), which will create
> +        * an ordered extent, and after that it will fault in the pages that the
> +        * iov_iter refers to. During the fault in we end up in the readahead
> +        * pages code (starting at btrfs_readahead()), which will lock the range,
> +        * find that ordered extent and then wait for it to complete (at
> +        * btrfs_lock_and_flush_ordered_range()), resulting in a deadlock since
> +        * obviously the ordered extent can never complete as we didn't submit
> +        * yet the respective bio(s). This always happens when the buffer is
> +        * memory mapped to the same file range, since the iomap DIO code always
> +        * invalidates pages in the target file range (after starting and waiting
> +        * for any writeback).
> +        *
> +        * So here we disable page faults in the iov_iter and then retry if we
> +        * got -EFAULT, faulting in the pages before the retry.
> +        */
> +again:
> +       from->nofault = true;
> +       err = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> +                          IOMAP_DIO_PARTIAL, written);
> +       from->nofault = false;
>
> -       if (IS_ERR_OR_NULL(dio)) {
> -               err = PTR_ERR_OR_ZERO(dio);
> -               if (err < 0 && err != -ENOTBLK)
> -                       goto out;
> -       } else {
> -               written = iomap_dio_complete(dio);
> +       /* No increment (+=) because iomap returns a cumulative value. */
> +       if (err > 0)
> +               written = err;
> +
> +       if (iov_iter_count(from) > 0 && (err == -EFAULT || err > 0)) {
> +               const size_t left = iov_iter_count(from);
> +               /*
> +                * We have more data left to write. Try to fault in as many as
> +                * possible of the remainder pages and retry. We do this without
> +                * releasing and locking again the inode, to prevent races with
> +                * truncate.
> +                *
> +                * Also, in case the iov refers to pages in the file range of the
> +                * file we want to write to (due to a mmap), we could enter an
> +                * infinite loop if we retry after faulting the pages in, since
> +                * iomap will invalidate any pages in the range early on, before
> +                * it tries to fault in the pages of the iov. So we keep track of
> +                * how much was left of iov in the previous EFAULT and fallback
> +                * to buffered IO in case we haven't made any progress.
> +                */
> +               if (left == prev_left) {
> +                       err = -ENOTBLK;
> +               } else {
> +                       fault_in_iov_iter_readable(from, left);
> +                       prev_left = left;
> +                       goto again;
> +               }
>         }
>
> -       if (written < 0 || !iov_iter_count(from)) {
> -               err = written;
> +       btrfs_inode_unlock(inode, ilock_flags);
> +
> +       /*
> +        * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
> +        * the fsync (call generic_write_sync()).
> +        */
> +       if (is_sync_write)
> +               iocb->ki_flags |= IOCB_DSYNC;
> +
> +       /* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
> +       if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
>                 goto out;
> -       }
>
>  buffered:
>         pos = iocb->ki_pos;
> @@ -2005,7 +2063,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>         invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
>                                  endbyte >> PAGE_SHIFT);
>  out:
> -       return written ? written : err;
> +       return err < 0 ? err : written;
>  }
>
>  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> @@ -3659,6 +3717,8 @@ static int check_direct_read(struct btrfs_fs_info *fs_info,
>  static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  {
>         struct inode *inode = file_inode(iocb->ki_filp);
> +       size_t prev_left = 0;
> +       ssize_t read = 0;
>         ssize_t ret;
>
>         if (fsverity_active(inode))
> @@ -3668,10 +3728,57 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>                 return 0;
>
>         btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> +again:
> +       /*
> +        * This is similar to what we do for direct IO writes, see the comment
> +        * at btrfs_direct_write(), but we also disable page faults in addition
> +        * to disabling them only at the iov_iter level. This is because when
> +        * reading from a hole or prealloc extent, iomap calls iov_iter_zero(),
> +        * which can still trigger page fault ins despite having set ->nofault
> +        * to true of our 'to' iov_iter.
> +        *
> +        * The difference to direct IO writes is that we deadlock when trying
> +        * to lock the extent range in the inode's tree during he page reads
> +        * triggered by the fault in (while for writes it is due to waiting for
> +        * our own ordered extent). This is because for direct IO reads,
> +        * btrfs_dio_iomap_begin() returns with the extent range locked, which
> +        * is only unlocked in the endio callback (end_bio_extent_readpage()).
> +        */
> +       pagefault_disable();
> +       to->nofault = true;
>         ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -                          0, 0);
> +                          IOMAP_DIO_PARTIAL, read);
> +       to->nofault = false;
> +       pagefault_enable();
> +
> +       /* No increment (+=) because iomap returns a cumulative value. */
> +       if (ret > 0)
> +               read = ret;
> +
> +       if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
> +               const size_t left = iov_iter_count(to);
> +
> +               if (left == prev_left) {
> +                       /*
> +                        * We didn't make any progress since the last attempt,
> +                        * fallback to a buffered read for the remainder of the
> +                        * range. This is just to avoid any possibility of looping
> +                        * for too long.
> +                        */
> +                       ret = read;
> +               } else {
> +                       /*
> +                        * We made some progress since the last retry or this is
> +                        * the first time we are retrying. Fault in as many pages
> +                        * as possible and retry.
> +                        */
> +                       fault_in_iov_iter_writeable(to, left);
> +                       prev_left = left;
> +                       goto again;
> +               }
> +       }
>         btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> -       return ret;
> +       return ret < 0 ? ret : read;
>  }
>
>  static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] btrfs: fix deadlock due to page faults during direct IO reads and writes
  2021-11-09 11:27   ` Filipe Manana
@ 2021-11-09 12:39     ` David Sterba
  0 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2021-11-09 12:39 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, David Sterba

On Tue, Nov 09, 2021 at 11:27:25AM +0000, Filipe Manana wrote:
> On Mon, Oct 25, 2021 at 10:25 PM <fdmanana@kernel.org> wrote:
> >    "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"
> >
> > The thread can be found at:
> >
> > https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/
> 
> David, Linus merged that patchset (v9 actually, but without any impact
> on this patch) last week, in merge commit
> c03098d4b9ad76bca2966a8769dcfe59f7f85103.
> Do you want me to update the change log to mention it was already
> merged (and which commit)?

Not needed, I'll update the reference and send a pull request based on
the merge. Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-11-09 12:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-08 10:50 [PATCH] btrfs: fix deadlock due to page faults during direct IO reads and writes fdmanana
2021-09-09 19:21 ` Boris Burkov
2021-09-10  8:41   ` Filipe Manana
2021-09-10 16:44     ` Boris Burkov
2021-10-22  5:59 ` Wang Yugui
2021-10-22 10:54   ` Filipe Manana
2021-10-22 12:12     ` Wang Yugui
2021-10-22 13:17       ` Filipe Manana
2021-10-23  3:58         ` Wang Yugui
2021-10-25  9:41           ` Filipe Manana
2021-10-25  9:42 ` [PATCH v2] " fdmanana
2021-10-25 14:42   ` Josef Bacik
2021-10-25 14:54     ` Filipe Manana
2021-10-25 16:11       ` Josef Bacik
2021-10-25 16:27 ` [PATCH v3] " fdmanana
2021-10-25 18:58   ` Josef Bacik
2021-11-09 11:27   ` Filipe Manana
2021-11-09 12:39     ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).