All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/7] xfs: random fixes for 5.17
@ 2021-12-16  1:09 Darrick J. Wong
  2021-12-16  1:09 ` [PATCH 1/7] xfs: take the ILOCK when accessing the inode core Darrick J. Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16  1:09 UTC (permalink / raw)
  To: djwong; +Cc: Chandan Babu R, Ian Kent, linux-xfs

Hi all,

Well, we've reached the end of the year and the only thing that's ready
to go is this pile of bug fixes that I found while I've been quietly
QAing the online fsck code.  There's no particular order to any of
these.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=random-fixes-5.17
---
 fs/xfs/scrub/dir.c        |   15 +++++++++++----
 fs/xfs/scrub/quota.c      |    4 ++--
 fs/xfs/scrub/repair.c     |    3 +++
 fs/xfs/scrub/scrub.c      |    4 ----
 fs/xfs/scrub/scrub.h      |    1 -
 fs/xfs/xfs_buf_item.c     |    2 +-
 fs/xfs/xfs_dir2_readdir.c |   28 +++++++++++++++++-----------
 fs/xfs/xfs_iops.c         |   34 +---------------------------------
 fs/xfs/xfs_log.h          |    1 +
 fs/xfs/xfs_log_cil.c      |   25 ++++++++++++++++++++++---
 fs/xfs/xfs_log_recover.c  |   24 +++++++++++++++++++++++-
 fs/xfs/xfs_mount.c        |   10 ----------
 fs/xfs/xfs_qm_syscalls.c  |   11 +----------
 fs/xfs/xfs_reflink.c      |    5 ++++-
 fs/xfs/xfs_super.c        |    9 ---------
 fs/xfs/xfs_symlink.c      |   19 ++++++++++++++-----
 fs/xfs/xfs_trans.c        |   11 ++++++++++-
 17 files changed, 110 insertions(+), 96 deletions(-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
  2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
@ 2021-12-16  1:09 ` Darrick J. Wong
  2021-12-16  4:56   ` Dave Chinner
  2021-12-16  1:09 ` [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

I was poking around in the directory code while diagnosing online fsck
bugs, and noticed that xfs_readdir doesn't actually take the directory
ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so
we're protected against writer threads, but we really need to follow the
locking model like we do in other places.  The same applies to the
shortform getdents function.

While we're at it, clean up the somewhat strange structure of this
function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_dir2_readdir.c |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 8310005af00f..25560151c273 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -507,8 +507,9 @@ xfs_readdir(
 	size_t			bufsize)
 {
 	struct xfs_da_args	args = { NULL };
-	int			rval;
-	int			v;
+	unsigned int		lock_mode;
+	int			error;
+	int			isblock;
 
 	trace_xfs_readdir(dp);
 
@@ -522,14 +523,19 @@ xfs_readdir(
 	args.geo = dp->i_mount->m_dir_geo;
 	args.trans = tp;
 
-	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
-		rval = xfs_dir2_sf_getdents(&args, ctx);
-	else if ((rval = xfs_dir2_isblock(&args, &v)))
-		;
-	else if (v)
-		rval = xfs_dir2_block_getdents(&args, ctx);
-	else
-		rval = xfs_dir2_leaf_getdents(&args, ctx, bufsize);
+	lock_mode = xfs_ilock_data_map_shared(dp);
+	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
+		xfs_iunlock(dp, lock_mode);
+		return xfs_dir2_sf_getdents(&args, ctx);
+	}
 
-	return rval;
+	error = xfs_dir2_isblock(&args, &isblock);
+	xfs_iunlock(dp, lock_mode);
+	if (error)
+		return error;
+
+	if (isblock)
+		return xfs_dir2_block_getdents(&args, ctx);
+
+	return xfs_dir2_leaf_getdents(&args, ctx, bufsize);
 }


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
  2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
  2021-12-16  1:09 ` [PATCH 1/7] xfs: take the ILOCK when accessing the inode core Darrick J. Wong
@ 2021-12-16  1:09 ` Darrick J. Wong
  2021-12-16  4:57   ` Dave Chinner
  2021-12-24  7:16   ` Christoph Hellwig
  2021-12-16  1:09 ` [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check Darrick J. Wong
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

While debugging some very strange rmap corruption reports in connection
with the online directory repair code.  I root-caused the error to the
following incorrect sequence:

<start repair transaction>
<expand directory, causing a deferred rmap to be queued>
<roll transaction>
<cancel transaction>

Obviously, we should have committed the transaction instead of
cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
have warned us that we were throwing away work item that we already
committed to performing.  This is not correct, and we need to shut down
the filesystem.

Change xfs_trans_cancel to complain in the loudest manner if we're
cancelling any transaction with deferred work items attached.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_trans.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 234a9d9c2f43..59e2f9031b9f 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -942,8 +942,17 @@ xfs_trans_cancel(
 
 	trace_xfs_trans_cancel(tp, _RET_IP_);
 
-	if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
+	/*
+	 * It's never valid to cancel a transaction with deferred ops attached,
+	 * because the transaction is effectively dirty.  Complain about this
+	 * loudly before freeing the in-memory defer items.
+	 */
+	if (!list_empty(&tp->t_dfops)) {
+		ASSERT(xfs_is_shutdown(mp) || list_empty(&tp->t_dfops));
+		ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+		dirty = true;
 		xfs_defer_cancel(tp);
+	}
 
 	/*
 	 * See if the caller is relying on us to shut down the


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check
  2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
  2021-12-16  1:09 ` [PATCH 1/7] xfs: take the ILOCK when accessing the inode core Darrick J. Wong
  2021-12-16  1:09 ` [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
@ 2021-12-16  1:09 ` Darrick J. Wong
  2021-12-16  5:05   ` Dave Chinner
                     ` (2 more replies)
  2021-12-16  1:09 ` [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt Darrick J. Wong
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When xfs_scrub encounters a directory with a leaf1 block, it tries to
validate that the leaf1 block's bestcount (aka the best free count of
each directory data block) is the correct size.  Previously, this author
believed that comparing bestcount to the directory isize (since
directory data blocks are under isize, and leaf/bestfree blocks are
above it) was sufficient.

Unfortunately during testing of online repair, it was discovered that it
is possible to create a directory with a hole between the last directory
block and isize.  The directory code seems to handle this situation just
fine and xfs_repair doesn't complain, which effectively makes this quirk
part of the disk format.

Fix the check to work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/dir.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 200a63f58fe7..9a16932d77ce 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -497,6 +497,7 @@ STATIC int
 xchk_directory_leaf1_bestfree(
 	struct xfs_scrub		*sc,
 	struct xfs_da_args		*args,
+	xfs_dir2_db_t			last_data_db,
 	xfs_dablk_t			lblk)
 {
 	struct xfs_dir3_icleaf_hdr	leafhdr;
@@ -534,10 +535,14 @@ xchk_directory_leaf1_bestfree(
 	}
 
 	/*
-	 * There should be as many bestfree slots as there are dir data
-	 * blocks that can fit under i_size.
+	 * There must be enough bestfree slots to cover all the directory data
+	 * blocks that we scanned.  It is possible for there to be a hole
+	 * between the last data block and i_disk_size.  This seems like an
+	 * oversight to the scrub author, but as we have been writing out
+	 * directories like this (and xfs_repair doesn't mind them) for years,
+	 * that's what we have to check.
 	 */
-	if (bestcount != xfs_dir2_byte_to_db(geo, sc->ip->i_disk_size)) {
+	if (bestcount != last_data_db + 1) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
 		goto out;
 	}
@@ -669,6 +674,7 @@ xchk_directory_blocks(
 	xfs_fileoff_t		lblk;
 	struct xfs_iext_cursor	icur;
 	xfs_dablk_t		dabno;
+	xfs_dir2_db_t		last_data_db = 0;
 	bool			found;
 	int			is_block = 0;
 	int			error;
@@ -712,6 +718,7 @@ xchk_directory_blocks(
 				args.geo->fsbcount);
 		     lblk < got.br_startoff + got.br_blockcount;
 		     lblk += args.geo->fsbcount) {
+			last_data_db = xfs_dir2_da_to_db(mp->m_dir_geo, lblk);
 			error = xchk_directory_data_bestfree(sc, lblk,
 					is_block);
 			if (error)
@@ -734,7 +741,7 @@ xchk_directory_blocks(
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
 			goto out;
 		}
-		error = xchk_directory_leaf1_bestfree(sc, &args,
+		error = xchk_directory_leaf1_bestfree(sc, &args, last_data_db,
 				leaf_lblk);
 		if (error)
 			goto out;


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt
  2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-12-16  1:09 ` [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check Darrick J. Wong
@ 2021-12-16  1:09 ` Darrick J. Wong
  2021-12-16  4:36   ` Dave Chinner
  2021-12-16  1:09 ` [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it Darrick J. Wong
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

While I was running with KASAN and lockdep enabled, I stumbled upon an
KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
that the original patch to xfs_defer_finish_noroll should have done
something to lock the CIL to prevent it from switching the CIL contexts
while the predicate runs.

For upper level code that needs to know if a given log item is new
enough not to need relogging, add a new wrapper that takes the CIL
context lock long enough to sample the current CIL context.  This is
kind of racy in that the CIL can switch the contexts immediately after
sampling, but that's ok because the consequence is that the defer ops
code is a little slow to relog items.

 ==================================================================
 BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
 Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999

 CPU: 1 PID: 527999 Comm: fsstress Tainted: G      D      5.16.0-rc4-xfsx #rc4
 Call Trace:
  <TASK>
  dump_stack_lvl+0x45/0x59
  print_address_description.constprop.0+0x1f/0x140
  kasan_report.cold+0x83/0xdf
  xfs_log_item_in_current_chkpt+0x139/0x160
  xfs_defer_finish_noroll+0x3bb/0x1e30
  __xfs_trans_commit+0x6c8/0xcf0
  xfs_reflink_remap_extent+0x66f/0x10e0
  xfs_reflink_remap_blocks+0x2dd/0xa90
  xfs_file_remap_range+0x27b/0xc30
  vfs_dedupe_file_range_one+0x368/0x420
  vfs_dedupe_file_range+0x37c/0x5d0
  do_vfs_ioctl+0x308/0x1260
  __x64_sys_ioctl+0xa1/0x170
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f2c71a2950b
 Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
 RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
 RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
 RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
 R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
  </TASK>

 Allocated by task 464064:
  kasan_save_stack+0x1e/0x50
  __kasan_kmalloc+0x81/0xa0
  kmem_alloc+0xcd/0x2c0 [xfs]
  xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
  xlog_cil_push_work+0x141/0x13d0 [xfs]
  process_one_work+0x7f6/0x1380
  worker_thread+0x59d/0x1040
  kthread+0x3b0/0x490
  ret_from_fork+0x1f/0x30

 Freed by task 51:
  kasan_save_stack+0x1e/0x50
  kasan_set_track+0x21/0x30
  kasan_set_free_info+0x20/0x30
  __kasan_slab_free+0xed/0x130
  slab_free_freelist_hook+0x7f/0x160
  kfree+0xde/0x340
  xlog_cil_committed+0xbfd/0xfe0 [xfs]
  xlog_cil_process_committed+0x103/0x1c0 [xfs]
  xlog_state_do_callback+0x45d/0xbd0 [xfs]
  xlog_ioend_work+0x116/0x1c0 [xfs]
  process_one_work+0x7f6/0x1380
  worker_thread+0x59d/0x1040
  kthread+0x3b0/0x490
  ret_from_fork+0x1f/0x30

 Last potentially related work creation:
  kasan_save_stack+0x1e/0x50
  __kasan_record_aux_stack+0xb7/0xc0
  insert_work+0x48/0x2e0
  __queue_work+0x4e7/0xda0
  queue_work_on+0x69/0x80
  xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
  xlog_cil_force_seq+0x1b7/0x850 [xfs]
  xfs_log_force_seq+0x1c7/0x670 [xfs]
  xfs_file_fsync+0x7c1/0xa60 [xfs]
  __x64_sys_fsync+0x52/0x80
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae

 The buggy address belongs to the object at ffff88804ea5f600
  which belongs to the cache kmalloc-256 of size 256
 The buggy address is located 8 bytes inside of
  256-byte region [ffff88804ea5f600, ffff88804ea5f700)
 The buggy address belongs to the page:
 page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
 head:ffffea00013a9780 order:1 compound_mapcount:0
 flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
 raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
 raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                       ^
  ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ==================================================================

Fixes: 4e919af7827a ("xfs: periodically relog deferred intent items")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf_item.c |    2 +-
 fs/xfs/xfs_log.h      |    1 +
 fs/xfs/xfs_log_cil.c  |   25 ++++++++++++++++++++++---
 3 files changed, 24 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index a7a8e4528881..41a77f4e42ab 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -430,7 +430,7 @@ xfs_buf_item_format(
 	if (bip->bli_flags & XFS_BLI_INODE_BUF) {
 		if (xfs_has_v3inodes(lip->li_mountp) ||
 		    !((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) &&
-		      xfs_log_item_in_current_chkpt(lip)))
+		      __xfs_log_item_in_current_chkpt(lip)))
 			bip->__bli_format.blf_flags |= XFS_BLF_INODE_BUF;
 		bip->bli_flags &= ~XFS_BLI_INODE_BUF;
 	}
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index dc1b77b92fc1..6c7b6981c443 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -132,6 +132,7 @@ struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
 void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
 
 void	xlog_cil_process_committed(struct list_head *list);
+bool	__xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
 bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
 
 void	xfs_log_work_queue(struct xfs_mount *mp);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 6c93c8ada6f3..62e84e040642 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1441,10 +1441,10 @@ xlog_cil_force_seq(
  * transaction commit process when deciding what to format into the item.
  */
 bool
-xfs_log_item_in_current_chkpt(
-	struct xfs_log_item *lip)
+__xfs_log_item_in_current_chkpt(
+	struct xfs_log_item	*lip)
 {
-	struct xfs_cil_ctx *ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
+	struct xfs_cil_ctx	*ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
 
 	if (list_empty(&lip->li_cil))
 		return false;
@@ -1457,6 +1457,25 @@ xfs_log_item_in_current_chkpt(
 	return lip->li_seq == ctx->sequence;
 }
 
+/*
+ * Check if the current log item was first committed in this sequence.
+ * This version locks out CIL flushes, but can only be used to gather
+ * a rough estimate of the answer.
+ */
+bool
+xfs_log_item_in_current_chkpt(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_cil		*cil = lip->li_mountp->m_log->l_cilp;
+	bool			ret;
+
+	down_read(&cil->xc_ctx_lock);
+	ret = __xfs_log_item_in_current_chkpt(lip);
+	up_read(&cil->xc_ctx_lock);
+
+	return ret;
+}
+
 /*
  * Perform initial CIL structure initialisation.
  */


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it
  2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-12-16  1:09 ` [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt Darrick J. Wong
@ 2021-12-16  1:09 ` Darrick J. Wong
  2021-12-16  5:07   ` Dave Chinner
  2021-12-24  7:17   ` Christoph Hellwig
  2021-12-16  1:09 ` [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs Darrick J. Wong
  2021-12-16  1:09 ` [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
  6 siblings, 2 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Prior to commit 40b52225e58c ("xfs: remove support for disabling quota
accounting on a mounted file system"), we used the quotaoff mutex to
protect dquot operations against quotaoff trying to pull down dquots as
part of disabling quota.

Now that we only support turning off quota enforcement, the quotaoff
mutex only protects changes in m_qflags/sb_qflags.  We don't need it to
protect dquots, which means we can remove it from setqlimits and the
dquot scrub code.  While we're at it, fix the function that forces
quotacheck, since it should have been taking the quotaoff mutex.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/quota.c     |    4 ++--
 fs/xfs/scrub/repair.c    |    3 +++
 fs/xfs/scrub/scrub.c     |    4 ----
 fs/xfs/scrub/scrub.h     |    1 -
 fs/xfs/xfs_qm_syscalls.c |   11 +----------
 5 files changed, 6 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index d6c1b00a4fc8..3c7506c7553c 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -48,10 +48,10 @@ xchk_setup_quota(
 	dqtype = xchk_quota_to_dqtype(sc);
 	if (dqtype == 0)
 		return -EINVAL;
-	sc->flags |= XCHK_HAS_QUOTAOFFLOCK;
-	mutex_lock(&sc->mp->m_quotainfo->qi_quotaofflock);
+
 	if (!xfs_this_quota_on(sc->mp, dqtype))
 		return -ENOENT;
+
 	error = xchk_setup_fs(sc);
 	if (error)
 		return error;
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 8f3cba14ada3..1e7b6b209ee8 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -25,6 +25,7 @@
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
 #include "xfs_quota.h"
+#include "xfs_qm.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -912,11 +913,13 @@ xrep_force_quotacheck(
 	if (!(flag & sc->mp->m_qflags))
 		return;
 
+	mutex_lock(&sc->mp->m_quotainfo->qi_quotaofflock);
 	sc->mp->m_qflags &= ~flag;
 	spin_lock(&sc->mp->m_sb_lock);
 	sc->mp->m_sb.sb_qflags &= ~flag;
 	spin_unlock(&sc->mp->m_sb_lock);
 	xfs_log_sb(sc->tp);
+	mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
 }
 
 /*
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 8d528d35b725..b11870d07c56 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -173,10 +173,6 @@ xchk_teardown(
 		mnt_drop_write_file(sc->file);
 	if (sc->flags & XCHK_REAPING_DISABLED)
 		xchk_start_reaping(sc);
-	if (sc->flags & XCHK_HAS_QUOTAOFFLOCK) {
-		mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
-		sc->flags &= ~XCHK_HAS_QUOTAOFFLOCK;
-	}
 	if (sc->buf) {
 		kmem_free(sc->buf);
 		sc->buf = NULL;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 80e5026bba44..3de5287e98d8 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -88,7 +88,6 @@ struct xfs_scrub {
 
 /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
-#define XCHK_HAS_QUOTAOFFLOCK	(1 << 1)  /* we hold the quotaoff lock */
 #define XCHK_REAPING_DISABLED	(1 << 2)  /* background block reaping paused */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 47fe60e1a887..7d5a31827681 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -302,13 +302,6 @@ xfs_qm_scall_setqlim(
 	if ((newlim->d_fieldmask & XFS_QC_MASK) == 0)
 		return 0;
 
-	/*
-	 * We don't want to race with a quotaoff so take the quotaoff lock.
-	 * We don't hold an inode lock, so there's nothing else to stop
-	 * a quotaoff from happening.
-	 */
-	mutex_lock(&q->qi_quotaofflock);
-
 	/*
 	 * Get the dquot (locked) before we start, as we need to do a
 	 * transaction to allocate it if it doesn't exist. Once we have the
@@ -319,7 +312,7 @@ xfs_qm_scall_setqlim(
 	error = xfs_qm_dqget(mp, id, type, true, &dqp);
 	if (error) {
 		ASSERT(error != -ENOENT);
-		goto out_unlock;
+		return error;
 	}
 
 	defq = xfs_get_defquota(q, xfs_dquot_type(dqp));
@@ -415,8 +408,6 @@ xfs_qm_scall_setqlim(
 
 out_rele:
 	xfs_qm_dqrele(dqp);
-out_unlock:
-	mutex_unlock(&q->qi_quotaofflock);
 	return error;
 }
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs
  2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-12-16  1:09 ` [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it Darrick J. Wong
@ 2021-12-16  1:09 ` Darrick J. Wong
  2021-12-16  5:11   ` Dave Chinner
  2021-12-24  7:22   ` Christoph Hellwig
  2021-12-16  1:09 ` [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
  6 siblings, 2 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16  1:09 UTC (permalink / raw)
  To: djwong; +Cc: Ian Kent, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Ian Kent reported that for inline symlinks, it's possible for
vfs_readlink to hang on to the target buffer returned by
_vn_get_link_inline long after it's been freed by xfs inode reclaim.
This is a layering violation -- we should never expose XFS internals to
the VFS.

When the symlink has a remote target, we allocate a separate buffer,
copy the internal information, and let the VFS manage the new buffer's
lifetime.  Let's adapt the inline code paths to do this too.  It's
less efficient, but fixes the layering violation and avoids the need to
adapt the if_data lifetime to rcu rules.  Clearly I don't care about
readlink benchmarks.

As a side note, this fixes the minor locking violation where we can
access the inode data fork without taking any locks; proper locking (and
eliminating the possibility of having to switch inode_operations on a
live inode) is essential to online repair coordinating repairs
correctly.

Reported-by: Ian Kent <raven@themaw.net>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iops.c    |   34 +---------------------------------
 fs/xfs/xfs_symlink.c |   19 ++++++++++++++-----
 2 files changed, 15 insertions(+), 38 deletions(-)


diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a607d6aca5c4..72bdd7c79e93 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -511,27 +511,6 @@ xfs_vn_get_link(
 	return ERR_PTR(error);
 }
 
-STATIC const char *
-xfs_vn_get_link_inline(
-	struct dentry		*dentry,
-	struct inode		*inode,
-	struct delayed_call	*done)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	char			*link;
-
-	ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
-
-	/*
-	 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if
-	 * if_data is junk.
-	 */
-	link = ip->i_df.if_u1.if_data;
-	if (XFS_IS_CORRUPT(ip->i_mount, !link))
-		return ERR_PTR(-EFSCORRUPTED);
-	return link;
-}
-
 static uint32_t
 xfs_stat_blksize(
 	struct xfs_inode	*ip)
@@ -1250,14 +1229,6 @@ static const struct inode_operations xfs_symlink_inode_operations = {
 	.update_time		= xfs_vn_update_time,
 };
 
-static const struct inode_operations xfs_inline_symlink_inode_operations = {
-	.get_link		= xfs_vn_get_link_inline,
-	.getattr		= xfs_vn_getattr,
-	.setattr		= xfs_vn_setattr,
-	.listxattr		= xfs_vn_listxattr,
-	.update_time		= xfs_vn_update_time,
-};
-
 /* Figure out if this file actually supports DAX. */
 static bool
 xfs_inode_supports_dax(
@@ -1408,10 +1379,7 @@ xfs_setup_iops(
 		inode->i_fop = &xfs_dir_file_operations;
 		break;
 	case S_IFLNK:
-		if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL)
-			inode->i_op = &xfs_inline_symlink_inode_operations;
-		else
-			inode->i_op = &xfs_symlink_inode_operations;
+		inode->i_op = &xfs_symlink_inode_operations;
 		break;
 	default:
 		inode->i_op = &xfs_inode_operations;
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index fc2c6a404647..4868bd986f52 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -22,6 +22,7 @@
 #include "xfs_trace.h"
 #include "xfs_trans.h"
 #include "xfs_ialloc.h"
+#include "xfs_error.h"
 
 /* ----- Kernel only functions below ----- */
 int
@@ -101,12 +102,10 @@ xfs_readlink(
 {
 	struct xfs_mount *mp = ip->i_mount;
 	xfs_fsize_t	pathlen;
-	int		error = 0;
+	int		error = -EFSCORRUPTED;
 
 	trace_xfs_readlink(ip);
 
-	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_LOCAL);
-
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
@@ -121,12 +120,22 @@ xfs_readlink(
 			 __func__, (unsigned long long) ip->i_ino,
 			 (long long) pathlen);
 		ASSERT(0);
-		error = -EFSCORRUPTED;
 		goto out;
 	}
 
+	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
+		/*
+		 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if
+		 * if_data is junk.
+		 */
+		if (XFS_IS_CORRUPT(ip->i_mount, !ip->i_df.if_u1.if_data))
+			goto out;
 
-	error = xfs_readlink_bmap_ilocked(ip, link);
+		memcpy(link, ip->i_df.if_u1.if_data, ip->i_disk_size + 1);
+		error = 0;
+	} else {
+		error = xfs_readlink_bmap_ilocked(ip, link);
+	}
 
  out:
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents
  2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-12-16  1:09 ` [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs Darrick J. Wong
@ 2021-12-16  1:09 ` Darrick J. Wong
  2021-12-16  4:41   ` Dave Chinner
  2021-12-24  7:18   ` Christoph Hellwig
  6 siblings, 2 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16  1:09 UTC (permalink / raw)
  To: djwong; +Cc: Chandan Babu R, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork.  In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.

As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them.  This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown.  As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.

Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time.  While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.

Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/xfs_log_recover.c |   24 +++++++++++++++++++++++-
 fs/xfs/xfs_mount.c       |   10 ----------
 fs/xfs/xfs_reflink.c     |    5 ++++-
 fs/xfs/xfs_super.c       |    9 ---------
 4 files changed, 27 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 53366cc0bc9e..8ecb9a8567b7 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -27,7 +27,7 @@
 #include "xfs_buf_item.h"
 #include "xfs_ag.h"
 #include "xfs_quota.h"
-
+#include "xfs_reflink.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -3498,6 +3498,28 @@ xlog_recover_finish(
 
 	xlog_recover_process_iunlinks(log);
 	xlog_recover_check_summary(log);
+
+	/*
+	 * Recover any CoW staging blocks that are still referenced by the
+	 * ondisk refcount metadata.  During mount there cannot be any live
+	 * staging extents as we have not permitted any user modifications.
+	 * Therefore, it is safe to free them all right now, even on a
+	 * read-only mount.
+	 */
+	error = xfs_reflink_recover_cow(log->l_mp);
+	if (error) {
+		xfs_alert(log->l_mp,
+	"Failed to recover leftover CoW staging extents, err %d.",
+				error);
+		/*
+		 * If we get an error here, make sure the log is shut down
+		 * but return zero so that any log items committed since the
+		 * end of intents processing can be pushed through the CIL
+		 * and AIL.
+		 */
+		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
+	}
+
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 359109b6f0d3..bed73e8002a5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -936,15 +936,6 @@ xfs_mountfs(
 			xfs_warn(mp,
 	"Unable to allocate reserve blocks. Continuing without reserve pool.");
 
-		/* Recover any CoW blocks that never got remapped. */
-		error = xfs_reflink_recover_cow(mp);
-		if (error) {
-			xfs_err(mp,
-	"Error %d recovering leftover CoW allocations.", error);
-			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-			goto out_quota;
-		}
-
 		/* Reserve AG blocks for future btree expansion. */
 		error = xfs_fs_reserve_ag_blocks(mp);
 		if (error && error != -ENOSPC)
@@ -955,7 +946,6 @@ xfs_mountfs(
 
  out_agresv:
 	xfs_fs_unreserve_ag_blocks(mp);
- out_quota:
 	xfs_qm_unmount_quotas(mp);
  out_rtunmount:
 	xfs_rtunmount_inodes(mp);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index cb0edb1d68ef..8b6c7163f684 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -749,7 +749,10 @@ xfs_reflink_end_cow(
 }
 
 /*
- * Free leftover CoW reservations that didn't get cleaned out.
+ * Free all CoW staging blocks that are still referenced by the ondisk refcount
+ * metadata.  The ondisk metadata does not track which inode created the
+ * staging extent, so callers must ensure that there are no cached inodes with
+ * live CoW staging extents.
  */
 int
 xfs_reflink_recover_cow(
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 778b57b1f020..c7ac486ca5d3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1739,15 +1739,6 @@ xfs_remount_rw(
 	 */
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
-
-	/* Recover any CoW blocks that never got remapped. */
-	error = xfs_reflink_recover_cow(mp);
-	if (error) {
-		xfs_err(mp,
-			"Error %d recovering leftover CoW allocations.", error);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-		return error;
-	}
 	xfs_blockgc_start(mp);
 
 	/* Create the per-AG metadata reservation pool .*/


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt
  2021-12-16  1:09 ` [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt Darrick J. Wong
@ 2021-12-16  4:36   ` Dave Chinner
  2021-12-16 16:35     ` Darrick J. Wong
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2021-12-16  4:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 15, 2021 at 05:09:37PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While I was running with KASAN and lockdep enabled, I stumbled upon an
> KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
> comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
> that the original patch to xfs_defer_finish_noroll should have done
> something to lock the CIL to prevent it from switching the CIL contexts
> while the predicate runs.
> 
> For upper level code that needs to know if a given log item is new
> enough not to need relogging, add a new wrapper that takes the CIL
> context lock long enough to sample the current CIL context.  This is
> kind of racy in that the CIL can switch the contexts immediately after
> sampling, but that's ok because the consequence is that the defer ops
> code is a little slow to relog items.
> 

I see the problem, but I don't think this is the right way to fix
it.  The CIL context lock is already a major contention point in the
transaction commit code when it is only taken once per
xfs_trans_commit() call.  If we now potentially take it once per
intent item per xfs_trans_commit() call, we're going to make the
contention even worse than it already is.

The current sequence is always available from the CIL itself via
cil->xc_current_sequence, and we can read that without needing any
locking to provide existence guarantees of the CIL structure.

So....

bool
xfs_log_item_in_current_chkpt(
	struct xfs_log_item *lip)
{
	struct xfs_cil	*cil = lip->li_mountp->m_log->l_cilp;

	if (list_empty(&lip->li_cil))
		return false;

	/*
	 * li_seq is written on the first commit of a log item to record the
	 * first checkpoint it is written to. Hence if it is different to the
	 * current sequence, we're in a new checkpoint.
	 */
	return lip->li_seq == READ_ONCE(cil->xc_current_sequence);
}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents
  2021-12-16  1:09 ` [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
@ 2021-12-16  4:41   ` Dave Chinner
  2021-12-24  7:18   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-12-16  4:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, linux-xfs

On Wed, Dec 15, 2021 at 05:09:54PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> As part of multiple customer escalations due to file data corruption
> after copy on write operations, I wrote some fstests that use fsstress
> to hammer on COW to shake things loose.  Regrettably, I caught some
> filesystem shutdowns due to incorrect rmap operations with the following
> loop:
> 
> mount <filesystem>				# (0)
> fsstress <run only readonly ops> &		# (1)
> while true; do
> 	fsstress <run all ops>
> 	mount -o remount,ro			# (2)
> 	fsstress <run only readonly ops>
> 	mount -o remount,rw			# (3)
> done
> 
> When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> call xfs_blockgc_stop to walk the inode cache to free all the COW
> extents, but the blockgc mechanism races with (1)'s reader threads to
> take IOLOCKs and loses, which means that it doesn't clean them all out.
> Call such a file (A).
> 
> When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> walks the ondisk refcount btree and frees any COW extent that it finds.
> This function does not check the inode cache, which means that incore
> COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> one of those former COW extents are allocated and mapped into another
> file (B) and someone triggers a COW to the stale reservation in (A), A's
> dirty data will be written into (B) and once that's done, those blocks
> will be transferred to (A)'s data fork without bumping the refcount.
> 
> The results are catastrophic -- file (B) and the refcount btree are now
> corrupt.  In the first patch, we fixed the race condition in (2) so that
> (A) will always flush the COW fork.  In this second patch, we move the
> _recover_cow call to the initial mount call in (0) for safety.
> 
> As mentioned previously, xfs_reflink_recover_cow walks the refcount
> btree looking for COW staging extents, and frees them.  This was
> intended to be run at mount time (when we know there are no live inodes)
> to clean up any leftover staging events that may have been left behind
> during an unclean shutdown.  As a time "optimization" for readonly
> mounts, we deferred this to the ro->rw transition, not realizing that
> any failure to clean all COW forks during a rw->ro transition would
> result in catastrophic corruption.
> 
> Therefore, remove this optimization and only run the recovery routine
> when we're guaranteed not to have any COW staging extents anywhere,
> which means we always run this at mount time.  While we're at it, move
> the callsite to xfs_log_mount_finish because any refcount btree
> expansion (however unlikely given that we're removing records from the
> right side of the index) must be fed by a per-AG reservation, which
> doesn't exist in its current location.
> 
> Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

Looks good now.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
  2021-12-16  1:09 ` [PATCH 1/7] xfs: take the ILOCK when accessing the inode core Darrick J. Wong
@ 2021-12-16  4:56   ` Dave Chinner
  2021-12-17 18:59     ` Darrick J. Wong
  2022-01-05  0:09     ` Dave Chinner
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2021-12-16  4:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 15, 2021 at 05:09:21PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> I was poking around in the directory code while diagnosing online fsck
> bugs, and noticed that xfs_readdir doesn't actually take the directory
> ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
> the data fork mappings

Yup, that is pretty much guaranteed. If the inode is extent or btree form as the
extent count will be non-zero, hence we can only get to the
xfs_dir2_isblock() check if the inode has moved from local to block
form between the open and xfs_dir2_isblock() get in the getdents
code.

> and the VFS took i_rwsem (aka IOLOCK_SHARED) so
> we're protected against writer threads, but we really need to follow the
> locking model like we do in other places.  The same applies to the
> shortform getdents function.

Locking rules should be the same as xfs_dir_lookup().....


> While we're at it, clean up the somewhat strange structure of this
> function.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_dir2_readdir.c |   28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 8310005af00f..25560151c273 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -507,8 +507,9 @@ xfs_readdir(
>  	size_t			bufsize)
>  {
>  	struct xfs_da_args	args = { NULL };
> -	int			rval;
> -	int			v;
> +	unsigned int		lock_mode;
> +	int			error;
> +	int			isblock;
>  
>  	trace_xfs_readdir(dp);
>  
> @@ -522,14 +523,19 @@ xfs_readdir(
>  	args.geo = dp->i_mount->m_dir_geo;
>  	args.trans = tp;
>  
> -	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> -		rval = xfs_dir2_sf_getdents(&args, ctx);
> -	else if ((rval = xfs_dir2_isblock(&args, &v)))
> -		;
> -	else if (v)
> -		rval = xfs_dir2_block_getdents(&args, ctx);
> -	else
> -		rval = xfs_dir2_leaf_getdents(&args, ctx, bufsize);
> +	lock_mode = xfs_ilock_data_map_shared(dp);
> +	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> +		xfs_iunlock(dp, lock_mode);
> +		return xfs_dir2_sf_getdents(&args, ctx);
> +	}
>  
> -	return rval;
> +	error = xfs_dir2_isblock(&args, &isblock);
> +	xfs_iunlock(dp, lock_mode);
> +	if (error)
> +		return error;
> +
> +	if (isblock)
> +		return xfs_dir2_block_getdents(&args, ctx);
> +
> +	return xfs_dir2_leaf_getdents(&args, ctx, bufsize);

Yeah, nah.

The ILOCK has to be held for xfs_dir2_block_getdents() and
xfs_dir2_leaf_getdents() for the same reason that it needs to be
held for xfs_dir2_isblock(). They both need to do BMBT lookups to
find the physical location of directory blocks in the directory, so
technically need to lock out modifications to the BMBT tree while
they are doing those lookups.

Yup, I know, VFS holds i_rwsem, so directory can't be modified while
xfs_readdir() is running, but if you are going to make one of these
functions have to take the ILOCK, then they all need to. See
xfs_dir_lookup()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
  2021-12-16  1:09 ` [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
@ 2021-12-16  4:57   ` Dave Chinner
  2021-12-24  7:16   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-12-16  4:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 15, 2021 at 05:09:26PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While debugging some very strange rmap corruption reports in connection
> with the online directory repair code.  I root-caused the error to the
> following incorrect sequence:
> 
> <start repair transaction>
> <expand directory, causing a deferred rmap to be queued>
> <roll transaction>
> <cancel transaction>
> 
> Obviously, we should have committed the transaction instead of
> cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
> have warned us that we were throwing away work item that we already
> committed to performing.  This is not correct, and we need to shut down
> the filesystem.
> 
> Change xfs_trans_cancel to complain in the loudest manner if we're
> cancelling any transaction with deferred work items attached.

Yup, seems reasonable.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check
  2021-12-16  1:09 ` [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check Darrick J. Wong
@ 2021-12-16  5:05   ` Dave Chinner
  2021-12-16 19:25     ` Darrick J. Wong
  2021-12-16 22:04   ` Dave Chinner
  2021-12-24  7:17   ` Christoph Hellwig
  2 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2021-12-16  5:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When xfs_scrub encounters a directory with a leaf1 block, it tries to
> validate that the leaf1 block's bestcount (aka the best free count of
> each directory data block) is the correct size.  Previously, this author
> believed that comparing bestcount to the directory isize (since
> directory data blocks are under isize, and leaf/bestfree blocks are
> above it) was sufficient.
> 
> Unfortunately during testing of online repair, it was discovered that it
> is possible to create a directory with a hole between the last directory
> block and isize.

We have xfs_da3_swap_lastblock() that can leave an -empty- da block
between the last referenced block and isize, but that's not a "hole"
in the file. If you don't mean xfs_da3_swap_lastblock(), then can
you clarify what you mean by a "hole" here and explain to me how the
situation it occurs in comes about? 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it
  2021-12-16  1:09 ` [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it Darrick J. Wong
@ 2021-12-16  5:07   ` Dave Chinner
  2021-12-24  7:17   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-12-16  5:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 15, 2021 at 05:09:43PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Prior to commit 40b52225e58c ("xfs: remove support for disabling quota
> accounting on a mounted file system"), we used the quotaoff mutex to
> protect dquot operations against quotaoff trying to pull down dquots as
> part of disabling quota.
> 
> Now that we only support turning off quota enforcement, the quotaoff
> mutex only protects changes in m_qflags/sb_qflags.  We don't need it to
> protect dquots, which means we can remove it from setqlimits and the
> dquot scrub code.  While we're at it, fix the function that forces
> quotacheck, since it should have been taking the quotaoff mutex.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs
  2021-12-16  1:09 ` [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs Darrick J. Wong
@ 2021-12-16  5:11   ` Dave Chinner
  2021-12-17  2:58     ` Ian Kent
  2021-12-24  7:22   ` Christoph Hellwig
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2021-12-16  5:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Ian Kent, linux-xfs

On Wed, Dec 15, 2021 at 05:09:48PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Ian Kent reported that for inline symlinks, it's possible for
> vfs_readlink to hang on to the target buffer returned by
> _vn_get_link_inline long after it's been freed by xfs inode reclaim.
> This is a layering violation -- we should never expose XFS internals to
> the VFS.
> 
> When the symlink has a remote target, we allocate a separate buffer,
> copy the internal information, and let the VFS manage the new buffer's
> lifetime.  Let's adapt the inline code paths to do this too.  It's
> less efficient, but fixes the layering violation and avoids the need to
> adapt the if_data lifetime to rcu rules.  Clearly I don't care about
> readlink benchmarks.
> 
> As a side note, this fixes the minor locking violation where we can
> access the inode data fork without taking any locks; proper locking (and
> eliminating the possibility of having to switch inode_operations on a
> live inode) is essential to online repair coordinating repairs
> correctly.
> 
> Reported-by: Ian Kent <raven@themaw.net>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks fine, nicely avoids all the nasty RCU interactions trying to
handle this after the fact.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt
  2021-12-16  4:36   ` Dave Chinner
@ 2021-12-16 16:35     ` Darrick J. Wong
  0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16 16:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Dec 16, 2021 at 03:36:07PM +1100, Dave Chinner wrote:
> On Wed, Dec 15, 2021 at 05:09:37PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While I was running with KASAN and lockdep enabled, I stumbled upon an
> > KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
> > comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
> > that the original patch to xfs_defer_finish_noroll should have done
> > something to lock the CIL to prevent it from switching the CIL contexts
> > while the predicate runs.
> > 
> > For upper level code that needs to know if a given log item is new
> > enough not to need relogging, add a new wrapper that takes the CIL
> > context lock long enough to sample the current CIL context.  This is
> > kind of racy in that the CIL can switch the contexts immediately after
> > sampling, but that's ok because the consequence is that the defer ops
> > code is a little slow to relog items.
> > 
> 
> I see the problem, but I don't think this is the right way to fix
> it.  The CIL context lock is already a major contention point in the
> transaction commit code when it is only taken once per
> xfs_trans_commit() call.  If we now potentially take it once per
> intent item per xfs_trans_commit() call, we're going to make the
> contention even worse than it already is.
> 
> The current sequence is always available from the CIL itself via
> cil->xc_current_sequence, and we can read that without needing any
> locking to provide existence guarantees of the CIL structure.
> 
> So....
> 
> bool
> xfs_log_item_in_current_chkpt(
> 	struct xfs_log_item *lip)
> {
> 	struct xfs_cil	*cil = lip->li_mountp->m_log->l_cilp;
> 
> 	if (list_empty(&lip->li_cil))
> 		return false;
> 
> 	/*
> 	 * li_seq is written on the first commit of a log item to record the
> 	 * first checkpoint it is written to. Hence if it is different to the
> 	 * current sequence, we're in a new checkpoint.
> 	 */
> 	return lip->li_seq == READ_ONCE(cil->xc_current_sequence);

Ooh, much better!

--D

> }
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check
  2021-12-16  5:05   ` Dave Chinner
@ 2021-12-16 19:25     ` Darrick J. Wong
  2021-12-16 21:17       ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16 19:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Dec 16, 2021 at 04:05:37PM +1100, Dave Chinner wrote:
> On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > When xfs_scrub encounters a directory with a leaf1 block, it tries to
> > validate that the leaf1 block's bestcount (aka the best free count of
> > each directory data block) is the correct size.  Previously, this author
> > believed that comparing bestcount to the directory isize (since
> > directory data blocks are under isize, and leaf/bestfree blocks are
> > above it) was sufficient.
> > 
> > Unfortunately during testing of online repair, it was discovered that it
> > is possible to create a directory with a hole between the last directory
> > block and isize.
> 
> We have xfs_da3_swap_lastblock() that can leave an -empty- da block
> between the last referenced block and isize, but that's not a "hole"
> in the file. If you don't mean xfs_da3_swap_lastblock(), then can
> you clarify what you mean by a "hole" here and explain to me how the
> situation it occurs in comes about?

I don't actually know how it comes about.  I wrote a test that sets up
fsstress to expand and contract directories and races xfs_scrub -n, and
noticed that I'd periodically get complaints about directories (usually
$SCRATCH_MNT/p$CPU) where the last block(s) before i_size were actually
holes.

I began reading the dir2 code to try to figure out how this came about
(clearly we're not updating i_size somewhere) but then took the shortcut
of seeing if xfs_repair or xfs_check complained about this situation.
Neither of them did, and I found a couple more directories in a similar
situation on my crash test dummy machine, and concluded "Wellllp, I
guess this is part of the ondisk format!" and committed the patch.

Also, I thought xfs_da3_swap_lastblock only operates on leaf and da
btree blocks, not the blocks containing directory entries?  I /think/
the actual explanation is that something goes wrong in
xfs_dir2_shrink_inode (maybe?) such that the mapping goes away but
i_disk_size doesn't get updated?  Not sure how /that/ can happen,
though...

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check
  2021-12-16 19:25     ` Darrick J. Wong
@ 2021-12-16 21:17       ` Dave Chinner
  2021-12-16 21:40         ` Darrick J. Wong
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2021-12-16 21:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Dec 16, 2021 at 11:25:49AM -0800, Darrick J. Wong wrote:
> On Thu, Dec 16, 2021 at 04:05:37PM +1100, Dave Chinner wrote:
> > On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > When xfs_scrub encounters a directory with a leaf1 block, it tries to
> > > validate that the leaf1 block's bestcount (aka the best free count of
> > > each directory data block) is the correct size.  Previously, this author
> > > believed that comparing bestcount to the directory isize (since
> > > directory data blocks are under isize, and leaf/bestfree blocks are
> > > above it) was sufficient.
> > > 
> > > Unfortunately during testing of online repair, it was discovered that it
> > > is possible to create a directory with a hole between the last directory
> > > block and isize.
> > 
> > We have xfs_da3_swap_lastblock() that can leave an -empty- da block
> > between the last referenced block and isize, but that's not a "hole"
> > in the file. If you don't mean xfs_da3_swap_lastblock(), then can
> > you clarify what you mean by a "hole" here and explain to me how the
> > situation it occurs in comes about?
> 
> I don't actually know how it comes about.  I wrote a test that sets up
> fsstress to expand and contract directories and races xfs_scrub -n, and
> noticed that I'd periodically get complaints about directories (usually
> $SCRATCH_MNT/p$CPU) where the last block(s) before i_size were actually
> holes.

Is that test getting to ENOSPC at all?

> I began reading the dir2 code to try to figure out how this came about
> (clearly we're not updating i_size somewhere) but then took the shortcut
> of seeing if xfs_repair or xfs_check complained about this situation.
> Neither of them did, and I found a couple more directories in a similar
> situation on my crash test dummy machine, and concluded "Wellllp, I
> guess this is part of the ondisk format!" and committed the patch.
> 
> Also, I thought xfs_da3_swap_lastblock only operates on leaf and da
> btree blocks, not the blocks containing directory entries?

Ah, right you are. I noticed xfs_da_shrink_inode() being called from
leaf_to_block() and thought it might be swapping the leaf with the
last data block that we probably just removed. Looking at the code,
that is not going to happend AFAICT...

> I /think/
> the actual explanation is that something goes wrong in
> xfs_dir2_shrink_inode (maybe?) such that the mapping goes away but
> i_disk_size doesn't get updated?  Not sure how /that/ can happen,
> though...

Actually, the ENOSPC case in xfs_dir2_shrink_inode is the likely
case. If we can't free the block because bunmapi gets ENOSPC due
to xfs_dir_rename() being called without a block reservation, it'll
just get left there as an empty data block. If all the other dir
data blocks around it get removed properly, it could eventually end
up between the last valid entry and isize....

There are lots of weird corner cases around ENOSPC in the directory
code, perhaps this is just another of them...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check
  2021-12-16 21:17       ` Dave Chinner
@ 2021-12-16 21:40         ` Darrick J. Wong
  0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-16 21:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Dec 17, 2021 at 08:17:48AM +1100, Dave Chinner wrote:
> On Thu, Dec 16, 2021 at 11:25:49AM -0800, Darrick J. Wong wrote:
> > On Thu, Dec 16, 2021 at 04:05:37PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > When xfs_scrub encounters a directory with a leaf1 block, it tries to
> > > > validate that the leaf1 block's bestcount (aka the best free count of
> > > > each directory data block) is the correct size.  Previously, this author
> > > > believed that comparing bestcount to the directory isize (since
> > > > directory data blocks are under isize, and leaf/bestfree blocks are
> > > > above it) was sufficient.
> > > > 
> > > > Unfortunately during testing of online repair, it was discovered that it
> > > > is possible to create a directory with a hole between the last directory
> > > > block and isize.
> > > 
> > > We have xfs_da3_swap_lastblock() that can leave an -empty- da block
> > > between the last referenced block and isize, but that's not a "hole"
> > > in the file. If you don't mean xfs_da3_swap_lastblock(), then can
> > > you clarify what you mean by a "hole" here and explain to me how the
> > > situation it occurs in comes about?
> > 
> > I don't actually know how it comes about.  I wrote a test that sets up
> > fsstress to expand and contract directories and races xfs_scrub -n, and
> > noticed that I'd periodically get complaints about directories (usually
> > $SCRATCH_MNT/p$CPU) where the last block(s) before i_size were actually
> > holes.
> 
> Is that test getting to ENOSPC at all?

Yes.  That particular VM has a generous 8GB of SCRATCH_DEV to make the
repairs more interesting.

> > I began reading the dir2 code to try to figure out how this came about
> > (clearly we're not updating i_size somewhere) but then took the shortcut
> > of seeing if xfs_repair or xfs_check complained about this situation.
> > Neither of them did, and I found a couple more directories in a similar
> > situation on my crash test dummy machine, and concluded "Wellllp, I
> > guess this is part of the ondisk format!" and committed the patch.
> > 
> > Also, I thought xfs_da3_swap_lastblock only operates on leaf and da
> > btree blocks, not the blocks containing directory entries?
> 
> Ah, right you are. I noticed xfs_da_shrink_inode() being called from
> leaf_to_block() and thought it might be swapping the leaf with the
> last data block that we probably just removed. Looking at the code,
> that is not going to happend AFAICT...
> 
> > I /think/
> > the actual explanation is that something goes wrong in
> > xfs_dir2_shrink_inode (maybe?) such that the mapping goes away but
> > i_disk_size doesn't get updated?  Not sure how /that/ can happen,
> > though...
> 
> Actually, the ENOSPC case in xfs_dir2_shrink_inode is the likely
> case. If we can't free the block because bunmapi gets ENOSPC due
> to xfs_dir_rename() being called without a block reservation, it'll
> just get left there as an empty data block. If all the other dir
> data blocks around it get removed properly, it could eventually end
> up between the last valid entry and isize....
> 
> There are lots of weird corner cases around ENOSPC in the directory
> code, perhaps this is just another of them...

<nod> The next time I reproduce it, I'll send you a metadump.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check
  2021-12-16  1:09 ` [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check Darrick J. Wong
  2021-12-16  5:05   ` Dave Chinner
@ 2021-12-16 22:04   ` Dave Chinner
  2021-12-24  7:17   ` Christoph Hellwig
  2 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2021-12-16 22:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 15, 2021 at 05:09:32PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When xfs_scrub encounters a directory with a leaf1 block, it tries to
> validate that the leaf1 block's bestcount (aka the best free count of
> each directory data block) is the correct size.  Previously, this author
> believed that comparing bestcount to the directory isize (since
> directory data blocks are under isize, and leaf/bestfree blocks are
> above it) was sufficient.
> 
> Unfortunately during testing of online repair, it was discovered that it
> is possible to create a directory with a hole between the last directory
> block and isize.  The directory code seems to handle this situation just
> fine and xfs_repair doesn't complain, which effectively makes this quirk
> part of the disk format.
> 
> Fix the check to work properly.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

With the "we're not sure how this happens" discussion out of the
way, the change to handle the empty space between the last block and
isize looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs
  2021-12-16  5:11   ` Dave Chinner
@ 2021-12-17  2:58     ` Ian Kent
  0 siblings, 0 replies; 32+ messages in thread
From: Ian Kent @ 2021-12-17  2:58 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong; +Cc: linux-xfs

On Thu, 2021-12-16 at 16:11 +1100, Dave Chinner wrote:
> On Wed, Dec 15, 2021 at 05:09:48PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Ian Kent reported that for inline symlinks, it's possible for
> > vfs_readlink to hang on to the target buffer returned by
> > _vn_get_link_inline long after it's been freed by xfs inode
> > reclaim.
> > This is a layering violation -- we should never expose XFS
> > internals to
> > the VFS.
> > 
> > When the symlink has a remote target, we allocate a separate
> > buffer,
> > copy the internal information, and let the VFS manage the new
> > buffer's
> > lifetime.  Let's adapt the inline code paths to do this too.  It's
> > less efficient, but fixes the layering violation and avoids the
> > need to
> > adapt the if_data lifetime to rcu rules.  Clearly I don't care
> > about
> > readlink benchmarks.
> > 
> > As a side note, this fixes the minor locking violation where we can
> > access the inode data fork without taking any locks; proper locking
> > (and
> > eliminating the possibility of having to switch inode_operations on
> > a
> > live inode) is essential to online repair coordinating repairs
> > correctly.
> > 
> > Reported-by: Ian Kent <raven@themaw.net>
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
> Looks fine, nicely avoids all the nasty RCU interactions trying to
> handle this after the fact.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 

Yes, I like it too and that original rcu gymnastics was always due
to the unclear ownership and lifetime of the link path buffer.

And I don't think needing to switch to ref-walk mode (due to the
memory allocation possibly blocking) is such a big performance
drawback as might be thought.

Acked-by: Ian Kent <raven@themaw.net>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
  2021-12-16  4:56   ` Dave Chinner
@ 2021-12-17 18:59     ` Darrick J. Wong
  2021-12-21  1:08       ` Darrick J. Wong
  2022-01-05  0:09     ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-17 18:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Dec 16, 2021 at 03:56:09PM +1100, Dave Chinner wrote:
> On Wed, Dec 15, 2021 at 05:09:21PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > I was poking around in the directory code while diagnosing online fsck
> > bugs, and noticed that xfs_readdir doesn't actually take the directory
> > ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
> > the data fork mappings
> 
> Yup, that is pretty much guaranteed. If the inode is extent or btree form as the
> extent count will be non-zero, hence we can only get to the
> xfs_dir2_isblock() check if the inode has moved from local to block
> form between the open and xfs_dir2_isblock() get in the getdents
> code.
> 
> > and the VFS took i_rwsem (aka IOLOCK_SHARED) so
> > we're protected against writer threads, but we really need to follow the
> > locking model like we do in other places.  The same applies to the
> > shortform getdents function.
> 
> Locking rules should be the same as xfs_dir_lookup().....
> 
> 
> > While we're at it, clean up the somewhat strange structure of this
> > function.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_dir2_readdir.c |   28 +++++++++++++++++-----------
> >  1 file changed, 17 insertions(+), 11 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> > index 8310005af00f..25560151c273 100644
> > --- a/fs/xfs/xfs_dir2_readdir.c
> > +++ b/fs/xfs/xfs_dir2_readdir.c
> > @@ -507,8 +507,9 @@ xfs_readdir(
> >  	size_t			bufsize)
> >  {
> >  	struct xfs_da_args	args = { NULL };
> > -	int			rval;
> > -	int			v;
> > +	unsigned int		lock_mode;
> > +	int			error;
> > +	int			isblock;
> >  
> >  	trace_xfs_readdir(dp);
> >  
> > @@ -522,14 +523,19 @@ xfs_readdir(
> >  	args.geo = dp->i_mount->m_dir_geo;
> >  	args.trans = tp;
> >  
> > -	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> > -		rval = xfs_dir2_sf_getdents(&args, ctx);
> > -	else if ((rval = xfs_dir2_isblock(&args, &v)))
> > -		;
> > -	else if (v)
> > -		rval = xfs_dir2_block_getdents(&args, ctx);
> > -	else
> > -		rval = xfs_dir2_leaf_getdents(&args, ctx, bufsize);
> > +	lock_mode = xfs_ilock_data_map_shared(dp);
> > +	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > +		xfs_iunlock(dp, lock_mode);
> > +		return xfs_dir2_sf_getdents(&args, ctx);
> > +	}
> >  
> > -	return rval;
> > +	error = xfs_dir2_isblock(&args, &isblock);
> > +	xfs_iunlock(dp, lock_mode);
> > +	if (error)
> > +		return error;
> > +
> > +	if (isblock)
> > +		return xfs_dir2_block_getdents(&args, ctx);
> > +
> > +	return xfs_dir2_leaf_getdents(&args, ctx, bufsize);
> 
> Yeah, nah.
> 
> The ILOCK has to be held for xfs_dir2_block_getdents() and
> xfs_dir2_leaf_getdents() for the same reason that it needs to be
> held for xfs_dir2_isblock(). They both need to do BMBT lookups to
> find the physical location of directory blocks in the directory, so
> technically need to lock out modifications to the BMBT tree while
> they are doing those lookups.
> 
> Yup, I know, VFS holds i_rwsem, so directory can't be modified while
> xfs_readdir() is running, but if you are going to make one of these
> functions have to take the ILOCK, then they all need to. See
> xfs_dir_lookup()....

Hmm.  I thought (and Chandan asked in passing) that the reason that we
keep cycling the directory ILOCK in the block/leaf getdents functions is
because the VFS ->actor functions (aka filldir) directly copy dirents to
userspace and we could trigger a page fault.  The page fault could
trigger memory reclaim, which could in turn route us to writeback with
that ILOCK still held.

Though, thinking about this further, the directory we have ILOCKed
doesn't itself use the page cache, so writeback will never touch it.
So I /think/ it's ok to grab the xfs_ilock_data_map_shared once in
xfs_readdir and hold it all the way to the end of the function?

Or at least I tried it and lockdep didn't complain immediately... :P

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
  2021-12-17 18:59     ` Darrick J. Wong
@ 2021-12-21  1:08       ` Darrick J. Wong
  2021-12-21  5:10         ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2021-12-21  1:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Chandan Babu R

On Fri, Dec 17, 2021 at 10:59:33AM -0800, Darrick J. Wong wrote:
> On Thu, Dec 16, 2021 at 03:56:09PM +1100, Dave Chinner wrote:
> > On Wed, Dec 15, 2021 at 05:09:21PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > I was poking around in the directory code while diagnosing online fsck
> > > bugs, and noticed that xfs_readdir doesn't actually take the directory
> > > ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
> > > the data fork mappings
> > 
> > Yup, that is pretty much guaranteed. If the inode is extent or btree form as the
> > extent count will be non-zero, hence we can only get to the
> > xfs_dir2_isblock() check if the inode has moved from local to block
> > form between the open and xfs_dir2_isblock() get in the getdents
> > code.
> > 
> > > and the VFS took i_rwsem (aka IOLOCK_SHARED) so
> > > we're protected against writer threads, but we really need to follow the
> > > locking model like we do in other places.  The same applies to the
> > > shortform getdents function.
> > 
> > Locking rules should be the same as xfs_dir_lookup().....
> > 
> > 
> > > While we're at it, clean up the somewhat strange structure of this
> > > function.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_dir2_readdir.c |   28 +++++++++++++++++-----------
> > >  1 file changed, 17 insertions(+), 11 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> > > index 8310005af00f..25560151c273 100644
> > > --- a/fs/xfs/xfs_dir2_readdir.c
> > > +++ b/fs/xfs/xfs_dir2_readdir.c
> > > @@ -507,8 +507,9 @@ xfs_readdir(
> > >  	size_t			bufsize)
> > >  {
> > >  	struct xfs_da_args	args = { NULL };
> > > -	int			rval;
> > > -	int			v;
> > > +	unsigned int		lock_mode;
> > > +	int			error;
> > > +	int			isblock;
> > >  
> > >  	trace_xfs_readdir(dp);
> > >  
> > > @@ -522,14 +523,19 @@ xfs_readdir(
> > >  	args.geo = dp->i_mount->m_dir_geo;
> > >  	args.trans = tp;
> > >  
> > > -	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> > > -		rval = xfs_dir2_sf_getdents(&args, ctx);
> > > -	else if ((rval = xfs_dir2_isblock(&args, &v)))
> > > -		;
> > > -	else if (v)
> > > -		rval = xfs_dir2_block_getdents(&args, ctx);
> > > -	else
> > > -		rval = xfs_dir2_leaf_getdents(&args, ctx, bufsize);
> > > +	lock_mode = xfs_ilock_data_map_shared(dp);
> > > +	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > +		xfs_iunlock(dp, lock_mode);
> > > +		return xfs_dir2_sf_getdents(&args, ctx);
> > > +	}
> > >  
> > > -	return rval;
> > > +	error = xfs_dir2_isblock(&args, &isblock);
> > > +	xfs_iunlock(dp, lock_mode);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	if (isblock)
> > > +		return xfs_dir2_block_getdents(&args, ctx);
> > > +
> > > +	return xfs_dir2_leaf_getdents(&args, ctx, bufsize);
> > 
> > Yeah, nah.
> > 
> > The ILOCK has to be held for xfs_dir2_block_getdents() and
> > xfs_dir2_leaf_getdents() for the same reason that it needs to be
> > held for xfs_dir2_isblock(). They both need to do BMBT lookups to
> > find the physical location of directory blocks in the directory, so
> > technically need to lock out modifications to the BMBT tree while
> > they are doing those lookups.
> > 
> > Yup, I know, VFS holds i_rwsem, so directory can't be modified while
> > xfs_readdir() is running, but if you are going to make one of these
> > functions have to take the ILOCK, then they all need to. See
> > xfs_dir_lookup()....
> 
> Hmm.  I thought (and Chandan asked in passing) that the reason that we
> keep cycling the directory ILOCK in the block/leaf getdents functions is
> because the VFS ->actor functions (aka filldir) directly copy dirents to
> userspace and we could trigger a page fault.  The page fault could
> trigger memory reclaim, which could in turn route us to writeback with
> that ILOCK still held.
> 
> Though, thinking about this further, the directory we have ILOCKed
> doesn't itself use the page cache, so writeback will never touch it.
> So I /think/ it's ok to grab the xfs_ilock_data_map_shared once in
> xfs_readdir and hold it all the way to the end of the function?
> 
> Or at least I tried it and lockdep didn't complain immediately... :P

But lockdep does complain now:

 ======================================================
 WARNING: possible circular locking dependency detected
 5.16.0-rc6-xfsx #rc6 Not tainted
 ------------------------------------------------------
 xfs_scrub/8151 is trying to acquire lock:
 ffff888040abcbe8 (&mm->mmap_lock#2){++++}-{4:4}, at: do_user_addr_fault+0x386/0x600
 
 but task is already holding lock:
 ffff8880270b87e8 (&xfs_dir_ilock_class){++++}-{4:4}, at: xfs_ilock_data_map_shared+0x2a/0x30 [xfs]
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 -> #2 (&xfs_dir_ilock_class){++++}-{4:4}:
        down_write_nested+0x41/0x80
        xfs_ilock+0xc9/0x270 [xfs]
        xfs_rename+0x559/0xb80 [xfs]
        xfs_vn_rename+0xdb/0x150 [xfs]
        vfs_rename+0x775/0xa70
        do_renameat2+0x355/0x510
        __x64_sys_renameat2+0x4b/0x60
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
 
 -> #1 (sb_internal){.+.+}-{0:0}:
        xfs_trans_alloc+0x1a8/0x3e0 [xfs]
        xfs_vn_update_time+0xca/0x2a0 [xfs]
        touch_atime+0x17d/0x2b0
        xfs_file_mmap+0xa7/0xb0 [xfs]
        mmap_region+0x3d8/0x600
        do_mmap+0x337/0x4f0
        vm_mmap_pgoff+0xa6/0x150
        ksys_mmap_pgoff+0x16f/0x1c0
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
 
 -> #0 (&mm->mmap_lock#2){++++}-{4:4}:
        __lock_acquire+0x116a/0x1eb0
        lock_acquire+0xc9/0x2f0
        down_read+0x3e/0x50
        do_user_addr_fault+0x386/0x600
        exc_page_fault+0x65/0x250
        asm_exc_page_fault+0x1b/0x20
        filldir64+0xb5/0x1b0
        xfs_dir2_sf_getdents+0x14e/0x370 [xfs]
        xfs_readdir+0x1fd/0x2b0 [xfs]
        iterate_dir+0x142/0x190
        __x64_sys_getdents64+0x7a/0x130
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
 
 other info that might help us debug this:
 
 Chain exists of:
   &mm->mmap_lock#2 --> sb_internal --> &xfs_dir_ilock_class
 
  Possible unsafe locking scenario:
 
        CPU0                    CPU1
        ----                    ----
   lock(&xfs_dir_ilock_class);
                                lock(sb_internal);
                                lock(&xfs_dir_ilock_class);
   lock(&mm->mmap_lock#2);
 
  *** DEADLOCK ***
 
 3 locks held by xfs_scrub/8151:
  #0: ffff88800a8aeaf0 (&f->f_pos_lock){+.+.}-{4:4}, at: __fdget_pos+0x4a/0x60
  #1: ffff8880270b8a08 (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{4:4}, at: iterate_dir+0x3d/0x190
  #2: ffff8880270b87e8 (&xfs_dir_ilock_class){++++}-{4:4}, at: xfs_ilock_data_map_shared+0x2a/0x30 [xfs]
 
 stack backtrace:
 CPU: 0 PID: 8151 Comm: xfs_scrub Not tainted 5.16.0-rc6-xfsx #rc6 574205e0343df89e2059bf7ee73cf2f2ec847f12
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x45/0x59
  check_noncircular+0xf2/0x110
  __lock_acquire+0x116a/0x1eb0
  lock_acquire+0xc9/0x2f0
  ? do_user_addr_fault+0x386/0x600
  down_read+0x3e/0x50
  ? do_user_addr_fault+0x386/0x600
  do_user_addr_fault+0x386/0x600
  exc_page_fault+0x65/0x250
  asm_exc_page_fault+0x1b/0x20
 RIP: 0010:filldir64+0xb5/0x1b0
 Code: 01 c0 48 29 ca 48 98 48 01 d0 0f 82 9f 00 00 00 48 b9 00 f0 ff ff ff 7f 00 00 48 39 c8 0f 87 8c 00 00 00 0f ae e8 4c 89 6a 08 <4c> 89 36 66 44 89 46 10 44 88 7e 12 48 8d 46 13 48 63 d5 c6 44 16
 RSP: 0018:ffffc900041ebd38 EFLAGS: 00010283
 RAX: 00007f729c004020 RBX: ffff88804616a96b RCX: 00007ffffffff000
 RDX: 00007f729c003fe0 RSI: 00007f729c004000 RDI: ffff88804616a970
 RBP: 0000000000000005 R08: 0000000000000020 R09: 0000000000000000
 R10: 0000000000000002 R11: ffff88804616a96b R12: ffffc900041ebee0
 R13: 0000000000000022 R14: 0000000003009b6b R15: 0000000000000002
  ? filldir64+0x3b/0x1b0
  xfs_dir2_sf_getdents+0x14e/0x370 [xfs 802a19c6d5ac0a8a2cd22c73d30f7cd9e92f7194]
  xfs_readdir+0x1fd/0x2b0 [xfs 802a19c6d5ac0a8a2cd22c73d30f7cd9e92f7194]
  iterate_dir+0x142/0x190
  __x64_sys_getdents64+0x7a/0x130
  ? fillonedir+0x160/0x160
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f72ab7d543b
 Code: 0f 1e fa 48 8b 47 20 c3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 81 fa ff ff ff 7f b8 ff ff ff 7f 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 21 9a 10 00 f7 d8
 RSP: 002b:00007f72a88d6a58 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
 RAX: ffffffffffffffda RBX: 00007f729c003f00 RCX: 00007f72ab7d543b
 RDX: 0000000000008000 RSI: 00007f729c003f00 RDI: 0000000000000006
 RBP: fffffffffffffe00 R08: 0000000000000030 R09: 00007f729c000780
 R10: 00007f729c003c40 R11: 0000000000000293 R12: 00007f729c003ed4
 R13: 0000000000000000 R14: 00007f729c003ed0 R15: 00007f72a0003e10
  </TASK>

IOWs, we have to drop the ILOCK when calling dir_emit because:

1. Rename takes sb_internal (xfs_trans_alloc) and then a directory ILOCK;
2. A pagefault can take the MMAPLOCK and then sb_internal to update the
   file mtime;
3. Now we've made readdir take the directory ILOCK and do something that
   can cause a userspace pagefault.

So with that in mind, can I get a re-review of the original patch?  I'll
add the above to the commit message as a justification for why we can't
just move the ilock/iunlock calls.

--D

> 
> --D
> 
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
  2021-12-21  1:08       ` Darrick J. Wong
@ 2021-12-21  5:10         ` Dave Chinner
  2022-01-04  1:19           ` Darrick J. Wong
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2021-12-21  5:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Chandan Babu R

On Mon, Dec 20, 2021 at 05:08:55PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 17, 2021 at 10:59:33AM -0800, Darrick J. Wong wrote:
> > On Thu, Dec 16, 2021 at 03:56:09PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 15, 2021 at 05:09:21PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > I was poking around in the directory code while diagnosing online fsck
> > > > bugs, and noticed that xfs_readdir doesn't actually take the directory
> > > > ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
> > > > the data fork mappings
> > > 
> > > Yup, that is pretty much guaranteed. If the inode is extent or btree form as the
> > > extent count will be non-zero, hence we can only get to the
> > > xfs_dir2_isblock() check if the inode has moved from local to block
> > > form between the open and xfs_dir2_isblock() get in the getdents
> > > code.
> > > 
> > > > and the VFS took i_rwsem (aka IOLOCK_SHARED) so
> > > > we're protected against writer threads, but we really need to follow the
> > > > locking model like we do in other places.  The same applies to the
> > > > shortform getdents function.
> > > 
> > > Locking rules should be the same as xfs_dir_lookup().....
....
> > > Yup, I know, VFS holds i_rwsem, so directory can't be modified while
> > > xfs_readdir() is running, but if you are going to make one of these
> > > functions have to take the ILOCK, then they all need to. See
> > > xfs_dir_lookup()....
> > 
> > Hmm.  I thought (and Chandan asked in passing) that the reason that we
> > keep cycling the directory ILOCK in the block/leaf getdents functions is
> > because the VFS ->actor functions (aka filldir) directly copy dirents to
> > userspace and we could trigger a page fault.  The page fault could
> > trigger memory reclaim, which could in turn route us to writeback with
> > that ILOCK still held.
> > 
> > Though, thinking about this further, the directory we have ILOCKed
> > doesn't itself use the page cache, so writeback will never touch it.
> > So I /think/ it's ok to grab the xfs_ilock_data_map_shared once in
> > xfs_readdir and hold it all the way to the end of the function?
> > 
> > Or at least I tried it and lockdep didn't complain immediately... :P
> 
> But lockdep does complain now:
> 
>  ======================================================
>  WARNING: possible circular locking dependency detected
>  5.16.0-rc6-xfsx #rc6 Not tainted
>  ------------------------------------------------------
>  xfs_scrub/8151 is trying to acquire lock:
>  ffff888040abcbe8 (&mm->mmap_lock#2){++++}-{4:4}, at: do_user_addr_fault+0x386/0x600
>  
>  but task is already holding lock:
>  ffff8880270b87e8 (&xfs_dir_ilock_class){++++}-{4:4}, at: xfs_ilock_data_map_shared+0x2a/0x30 [xfs]
>  
>  which lock already depends on the new lock.
>  
>  
>  the existing dependency chain (in reverse order) is:
>  
>  -> #2 (&xfs_dir_ilock_class){++++}-{4:4}:
>         down_write_nested+0x41/0x80
>         xfs_ilock+0xc9/0x270 [xfs]
>         xfs_rename+0x559/0xb80 [xfs]
>         xfs_vn_rename+0xdb/0x150 [xfs]
>         vfs_rename+0x775/0xa70
>         do_renameat2+0x355/0x510
>         __x64_sys_renameat2+0x4b/0x60
>         do_syscall_64+0x35/0x80
>         entry_SYSCALL_64_after_hwframe+0x44/0xae
>  
>  -> #1 (sb_internal){.+.+}-{0:0}:
>         xfs_trans_alloc+0x1a8/0x3e0 [xfs]
>         xfs_vn_update_time+0xca/0x2a0 [xfs]
>         touch_atime+0x17d/0x2b0
>         xfs_file_mmap+0xa7/0xb0 [xfs]
>         mmap_region+0x3d8/0x600
>         do_mmap+0x337/0x4f0
>         vm_mmap_pgoff+0xa6/0x150
>         ksys_mmap_pgoff+0x16f/0x1c0
>         do_syscall_64+0x35/0x80
>         entry_SYSCALL_64_after_hwframe+0x44/0xae

IDGI. That's a mmap() syscall, not a page fault. You can't mmap() a
directory inode, so this has to be a regular file inode...

>  -> #0 (&mm->mmap_lock#2){++++}-{4:4}:
>         __lock_acquire+0x116a/0x1eb0
>         lock_acquire+0xc9/0x2f0
>         down_read+0x3e/0x50
>         do_user_addr_fault+0x386/0x600
>         exc_page_fault+0x65/0x250
>         asm_exc_page_fault+0x1b/0x20
>         filldir64+0xb5/0x1b0
>         xfs_dir2_sf_getdents+0x14e/0x370 [xfs]
>         xfs_readdir+0x1fd/0x2b0 [xfs]
>         iterate_dir+0x142/0x190
>         __x64_sys_getdents64+0x7a/0x130
>         do_syscall_64+0x35/0x80
>         entry_SYSCALL_64_after_hwframe+0x44/0xae
>  
>  other info that might help us debug this:
>  
>  Chain exists of:
>    &mm->mmap_lock#2 --> sb_internal --> &xfs_dir_ilock_class
>  
>   Possible unsafe locking scenario:
>  
>         CPU0                    CPU1
>         ----                    ----
>    lock(&xfs_dir_ilock_class);
>                                 lock(sb_internal);
>                                 lock(&xfs_dir_ilock_class);
>    lock(&mm->mmap_lock#2);
>  
>   *** DEADLOCK ***
>  
>  3 locks held by xfs_scrub/8151:
>   #0: ffff88800a8aeaf0 (&f->f_pos_lock){+.+.}-{4:4}, at: __fdget_pos+0x4a/0x60
>   #1: ffff8880270b8a08 (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{4:4}, at: iterate_dir+0x3d/0x190
>   #2: ffff8880270b87e8 (&xfs_dir_ilock_class){++++}-{4:4}, at: xfs_ilock_data_map_shared+0x2a/0x30 [xfs]
>  
>  stack backtrace:
>  CPU: 0 PID: 8151 Comm: xfs_scrub Not tainted 5.16.0-rc6-xfsx #rc6 574205e0343df89e2059bf7ee73cf2f2ec847f12
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x45/0x59
>   check_noncircular+0xf2/0x110
>   __lock_acquire+0x116a/0x1eb0
>   lock_acquire+0xc9/0x2f0
>   ? do_user_addr_fault+0x386/0x600
>   down_read+0x3e/0x50
>   ? do_user_addr_fault+0x386/0x600
>   do_user_addr_fault+0x386/0x600
>   exc_page_fault+0x65/0x250
>   asm_exc_page_fault+0x1b/0x20
>  RIP: 0010:filldir64+0xb5/0x1b0
>  Code: 01 c0 48 29 ca 48 98 48 01 d0 0f 82 9f 00 00 00 48 b9 00 f0 ff ff ff 7f 00 00 48 39 c8 0f 87 8c 00 00 00 0f ae e8 4c 89 6a 08 <4c> 89 36 66 44 89 46 10 44 88 7e 12 48 8d 46 13 48 63 d5 c6 44 16
>  RSP: 0018:ffffc900041ebd38 EFLAGS: 00010283
>  RAX: 00007f729c004020 RBX: ffff88804616a96b RCX: 00007ffffffff000
>  RDX: 00007f729c003fe0 RSI: 00007f729c004000 RDI: ffff88804616a970
>  RBP: 0000000000000005 R08: 0000000000000020 R09: 0000000000000000
>  R10: 0000000000000002 R11: ffff88804616a96b R12: ffffc900041ebee0
>  R13: 0000000000000022 R14: 0000000003009b6b R15: 0000000000000002
>   ? filldir64+0x3b/0x1b0
>   xfs_dir2_sf_getdents+0x14e/0x370 [xfs 802a19c6d5ac0a8a2cd22c73d30f7cd9e92f7194]
>   xfs_readdir+0x1fd/0x2b0 [xfs 802a19c6d5ac0a8a2cd22c73d30f7cd9e92f7194]
>   iterate_dir+0x142/0x190
>   __x64_sys_getdents64+0x7a/0x130
>   ? fillonedir+0x160/0x160
>   do_syscall_64+0x35/0x80
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>  RIP: 0033:0x7f72ab7d543b
>  Code: 0f 1e fa 48 8b 47 20 c3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 81 fa ff ff ff 7f b8 ff ff ff 7f 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 21 9a 10 00 f7 d8
>  RSP: 002b:00007f72a88d6a58 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
>  RAX: ffffffffffffffda RBX: 00007f729c003f00 RCX: 00007f72ab7d543b
>  RDX: 0000000000008000 RSI: 00007f729c003f00 RDI: 0000000000000006
>  RBP: fffffffffffffe00 R08: 0000000000000030 R09: 00007f729c000780
>  R10: 00007f729c003c40 R11: 0000000000000293 R12: 00007f729c003ed4
>  R13: 0000000000000000 R14: 00007f729c003ed0 R15: 00007f72a0003e10
>   </TASK>
> 
> IOWs, we have to drop the ILOCK when calling dir_emit because:
> 
> 1. Rename takes sb_internal (xfs_trans_alloc) and then a directory ILOCK;
> 2. A pagefault can take the MMAPLOCK and then sb_internal to update the
>    file mtime;

Ok, let's assume that the lockdep report is actually a page fault
rather than a completely independent mmap() syscall.

A page fault on a mmap()d data buffer that won't get this far - if
there is a freeze in progress it will get stuck on on SB_PAGEFAULT
in __xfs_filemap_fault() before it updates the mtime.

Hence, AFAICT, if we have an inode stuck there waiting for a freeze
to make progress in a page fault, it means the readdir holds the
directory i_rwsem (IOLOCK) in read mode,

If this is the case, then the rename() syscall cannot get past
vfs_rename->lock_rename() as that will block trying to get the
directory i_rwsem in write mode that the readdir already holds.

ANd if we have the opposite, where we are in xfs_rename() waiting
for XFS_ILOCK_EXCL on the directory inodes, it means that the VFS is
holding the i_rwsem in write mode on the directory and hence readdir
gets locked out.

i.e. the vfs level i_rwsem locking prevents xfs_readdir() and
xfs_rename() being called on the same directory are the same time
and so the nested page fault recursion scenario indicated here does
not seem possible.

> 3. Now we've made readdir take the directory ILOCK and do something that
>    can cause a userspace pagefault.

Yup, and while that fault on the regular file is being handled,
the rename cannot get past the VFS because of the the directory
IOLOCK/i_rwsem is held...

> So with that in mind, can I get a re-review of the original patch?  I'll
> add the above to the commit message as a justification for why we can't
> just move the ilock/iunlock calls.

On the above, I think it's a false positive. Can you go through the
analysis again and check that I haven't missed a case where the VFS
allows concurrent read and write access to a single directory?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
  2021-12-16  1:09 ` [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
  2021-12-16  4:57   ` Dave Chinner
@ 2021-12-24  7:16   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2021-12-24  7:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 15, 2021 at 05:09:26PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While debugging some very strange rmap corruption reports in connection
> with the online directory repair code.  I root-caused the error to the
> following incorrect sequence:
> 
> <start repair transaction>
> <expand directory, causing a deferred rmap to be queued>
> <roll transaction>
> <cancel transaction>
> 
> Obviously, we should have committed the transaction instead of
> cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
> have warned us that we were throwing away work item that we already
> committed to performing.  This is not correct, and we need to shut down
> the filesystem.
> 
> Change xfs_trans_cancel to complain in the loudest manner if we're
> cancelling any transaction with deferred work items attached.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check
  2021-12-16  1:09 ` [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check Darrick J. Wong
  2021-12-16  5:05   ` Dave Chinner
  2021-12-16 22:04   ` Dave Chinner
@ 2021-12-24  7:17   ` Christoph Hellwig
  2 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2021-12-24  7:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it
  2021-12-16  1:09 ` [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it Darrick J. Wong
  2021-12-16  5:07   ` Dave Chinner
@ 2021-12-24  7:17   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2021-12-24  7:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents
  2021-12-16  1:09 ` [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
  2021-12-16  4:41   ` Dave Chinner
@ 2021-12-24  7:18   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2021-12-24  7:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs
  2021-12-16  1:09 ` [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs Darrick J. Wong
  2021-12-16  5:11   ` Dave Chinner
@ 2021-12-24  7:22   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2021-12-24  7:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Ian Kent, linux-xfs

On Wed, Dec 15, 2021 at 05:09:48PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Ian Kent reported that for inline symlinks, it's possible for
> vfs_readlink to hang on to the target buffer returned by
> _vn_get_link_inline long after it's been freed by xfs inode reclaim.
> This is a layering violation -- we should never expose XFS internals to
> the VFS.
> 
> When the symlink has a remote target, we allocate a separate buffer,
> copy the internal information, and let the VFS manage the new buffer's
> lifetime.  Let's adapt the inline code paths to do this too.  It's
> less efficient, but fixes the layering violation and avoids the need to
> adapt the if_data lifetime to rcu rules.  Clearly I don't care about
> readlink benchmarks.
> 
> As a side note, this fixes the minor locking violation where we can
> access the inode data fork without taking any locks; proper locking (and
> eliminating the possibility of having to switch inode_operations on a
> live inode) is essential to online repair coordinating repairs
> correctly.
> 
> Reported-by: Ian Kent <raven@themaw.net>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

This should also allow us to avoid the magic overallocation in
xfs_init_local_fork.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
  2021-12-21  5:10         ` Dave Chinner
@ 2022-01-04  1:19           ` Darrick J. Wong
  0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2022-01-04  1:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Chandan Babu R

On Tue, Dec 21, 2021 at 04:10:04PM +1100, Dave Chinner wrote:
> On Mon, Dec 20, 2021 at 05:08:55PM -0800, Darrick J. Wong wrote:
> > On Fri, Dec 17, 2021 at 10:59:33AM -0800, Darrick J. Wong wrote:
> > > On Thu, Dec 16, 2021 at 03:56:09PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 15, 2021 at 05:09:21PM -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > I was poking around in the directory code while diagnosing online fsck
> > > > > bugs, and noticed that xfs_readdir doesn't actually take the directory
> > > > > ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
> > > > > the data fork mappings
> > > > 
> > > > Yup, that is pretty much guaranteed. If the inode is extent or btree form as the
> > > > extent count will be non-zero, hence we can only get to the
> > > > xfs_dir2_isblock() check if the inode has moved from local to block
> > > > form between the open and xfs_dir2_isblock() get in the getdents
> > > > code.
> > > > 
> > > > > and the VFS took i_rwsem (aka IOLOCK_SHARED) so
> > > > > we're protected against writer threads, but we really need to follow the
> > > > > locking model like we do in other places.  The same applies to the
> > > > > shortform getdents function.
> > > > 
> > > > Locking rules should be the same as xfs_dir_lookup().....
> ....
> > > > Yup, I know, VFS holds i_rwsem, so directory can't be modified while
> > > > xfs_readdir() is running, but if you are going to make one of these
> > > > functions have to take the ILOCK, then they all need to. See
> > > > xfs_dir_lookup()....
> > > 
> > > Hmm.  I thought (and Chandan asked in passing) that the reason that we
> > > keep cycling the directory ILOCK in the block/leaf getdents functions is
> > > because the VFS ->actor functions (aka filldir) directly copy dirents to
> > > userspace and we could trigger a page fault.  The page fault could
> > > trigger memory reclaim, which could in turn route us to writeback with
> > > that ILOCK still held.
> > > 
> > > Though, thinking about this further, the directory we have ILOCKed
> > > doesn't itself use the page cache, so writeback will never touch it.
> > > So I /think/ it's ok to grab the xfs_ilock_data_map_shared once in
> > > xfs_readdir and hold it all the way to the end of the function?
> > > 
> > > Or at least I tried it and lockdep didn't complain immediately... :P
> > 
> > But lockdep does complain now:
> > 
> >  ======================================================
> >  WARNING: possible circular locking dependency detected
> >  5.16.0-rc6-xfsx #rc6 Not tainted
> >  ------------------------------------------------------
> >  xfs_scrub/8151 is trying to acquire lock:
> >  ffff888040abcbe8 (&mm->mmap_lock#2){++++}-{4:4}, at: do_user_addr_fault+0x386/0x600
> >  
> >  but task is already holding lock:
> >  ffff8880270b87e8 (&xfs_dir_ilock_class){++++}-{4:4}, at: xfs_ilock_data_map_shared+0x2a/0x30 [xfs]
> >  
> >  which lock already depends on the new lock.
> >  
> >  
> >  the existing dependency chain (in reverse order) is:
> >  
> >  -> #2 (&xfs_dir_ilock_class){++++}-{4:4}:
> >         down_write_nested+0x41/0x80
> >         xfs_ilock+0xc9/0x270 [xfs]
> >         xfs_rename+0x559/0xb80 [xfs]
> >         xfs_vn_rename+0xdb/0x150 [xfs]
> >         vfs_rename+0x775/0xa70
> >         do_renameat2+0x355/0x510
> >         __x64_sys_renameat2+0x4b/0x60
> >         do_syscall_64+0x35/0x80
> >         entry_SYSCALL_64_after_hwframe+0x44/0xae
> >  
> >  -> #1 (sb_internal){.+.+}-{0:0}:
> >         xfs_trans_alloc+0x1a8/0x3e0 [xfs]
> >         xfs_vn_update_time+0xca/0x2a0 [xfs]
> >         touch_atime+0x17d/0x2b0
> >         xfs_file_mmap+0xa7/0xb0 [xfs]
> >         mmap_region+0x3d8/0x600
> >         do_mmap+0x337/0x4f0
> >         vm_mmap_pgoff+0xa6/0x150
> >         ksys_mmap_pgoff+0x16f/0x1c0
> >         do_syscall_64+0x35/0x80
> >         entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> IDGI. That's a mmap() syscall, not a page fault. You can't mmap() a
> directory inode, so this has to be a regular file inode...
> 
> >  -> #0 (&mm->mmap_lock#2){++++}-{4:4}:
> >         __lock_acquire+0x116a/0x1eb0
> >         lock_acquire+0xc9/0x2f0
> >         down_read+0x3e/0x50
> >         do_user_addr_fault+0x386/0x600
> >         exc_page_fault+0x65/0x250
> >         asm_exc_page_fault+0x1b/0x20
> >         filldir64+0xb5/0x1b0
> >         xfs_dir2_sf_getdents+0x14e/0x370 [xfs]
> >         xfs_readdir+0x1fd/0x2b0 [xfs]
> >         iterate_dir+0x142/0x190
> >         __x64_sys_getdents64+0x7a/0x130
> >         do_syscall_64+0x35/0x80
> >         entry_SYSCALL_64_after_hwframe+0x44/0xae
> >  
> >  other info that might help us debug this:
> >  
> >  Chain exists of:
> >    &mm->mmap_lock#2 --> sb_internal --> &xfs_dir_ilock_class
> >  
> >   Possible unsafe locking scenario:
> >  
> >         CPU0                    CPU1
> >         ----                    ----
> >    lock(&xfs_dir_ilock_class);
> >                                 lock(sb_internal);
> >                                 lock(&xfs_dir_ilock_class);
> >    lock(&mm->mmap_lock#2);
> >  
> >   *** DEADLOCK ***
> >  
> >  3 locks held by xfs_scrub/8151:
> >   #0: ffff88800a8aeaf0 (&f->f_pos_lock){+.+.}-{4:4}, at: __fdget_pos+0x4a/0x60
> >   #1: ffff8880270b8a08 (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{4:4}, at: iterate_dir+0x3d/0x190
> >   #2: ffff8880270b87e8 (&xfs_dir_ilock_class){++++}-{4:4}, at: xfs_ilock_data_map_shared+0x2a/0x30 [xfs]
> >  
> >  stack backtrace:
> >  CPU: 0 PID: 8151 Comm: xfs_scrub Not tainted 5.16.0-rc6-xfsx #rc6 574205e0343df89e2059bf7ee73cf2f2ec847f12
> >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> >  Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x45/0x59
> >   check_noncircular+0xf2/0x110
> >   __lock_acquire+0x116a/0x1eb0
> >   lock_acquire+0xc9/0x2f0
> >   ? do_user_addr_fault+0x386/0x600
> >   down_read+0x3e/0x50
> >   ? do_user_addr_fault+0x386/0x600
> >   do_user_addr_fault+0x386/0x600
> >   exc_page_fault+0x65/0x250
> >   asm_exc_page_fault+0x1b/0x20
> >  RIP: 0010:filldir64+0xb5/0x1b0
> >  Code: 01 c0 48 29 ca 48 98 48 01 d0 0f 82 9f 00 00 00 48 b9 00 f0 ff ff ff 7f 00 00 48 39 c8 0f 87 8c 00 00 00 0f ae e8 4c 89 6a 08 <4c> 89 36 66 44 89 46 10 44 88 7e 12 48 8d 46 13 48 63 d5 c6 44 16
> >  RSP: 0018:ffffc900041ebd38 EFLAGS: 00010283
> >  RAX: 00007f729c004020 RBX: ffff88804616a96b RCX: 00007ffffffff000
> >  RDX: 00007f729c003fe0 RSI: 00007f729c004000 RDI: ffff88804616a970
> >  RBP: 0000000000000005 R08: 0000000000000020 R09: 0000000000000000
> >  R10: 0000000000000002 R11: ffff88804616a96b R12: ffffc900041ebee0
> >  R13: 0000000000000022 R14: 0000000003009b6b R15: 0000000000000002
> >   ? filldir64+0x3b/0x1b0
> >   xfs_dir2_sf_getdents+0x14e/0x370 [xfs 802a19c6d5ac0a8a2cd22c73d30f7cd9e92f7194]
> >   xfs_readdir+0x1fd/0x2b0 [xfs 802a19c6d5ac0a8a2cd22c73d30f7cd9e92f7194]
> >   iterate_dir+0x142/0x190
> >   __x64_sys_getdents64+0x7a/0x130
> >   ? fillonedir+0x160/0x160
> >   do_syscall_64+0x35/0x80
> >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> >  RIP: 0033:0x7f72ab7d543b
> >  Code: 0f 1e fa 48 8b 47 20 c3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 81 fa ff ff ff 7f b8 ff ff ff 7f 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 21 9a 10 00 f7 d8
> >  RSP: 002b:00007f72a88d6a58 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
> >  RAX: ffffffffffffffda RBX: 00007f729c003f00 RCX: 00007f72ab7d543b
> >  RDX: 0000000000008000 RSI: 00007f729c003f00 RDI: 0000000000000006
> >  RBP: fffffffffffffe00 R08: 0000000000000030 R09: 00007f729c000780
> >  R10: 00007f729c003c40 R11: 0000000000000293 R12: 00007f729c003ed4
> >  R13: 0000000000000000 R14: 00007f729c003ed0 R15: 00007f72a0003e10
> >   </TASK>
> > 
> > IOWs, we have to drop the ILOCK when calling dir_emit because:
> > 
> > 1. Rename takes sb_internal (xfs_trans_alloc) and then a directory ILOCK;
> > 2. A pagefault can take the MMAPLOCK and then sb_internal to update the
> >    file mtime;
> 
> Ok, let's assume that the lockdep report is actually a page fault
> rather than a completely independent mmap() syscall.
> 
> A page fault on a mmap()d data buffer that won't get this far - if
> there is a freeze in progress it will get stuck on on SB_PAGEFAULT
> in __xfs_filemap_fault() before it updates the mtime.
> 
> Hence, AFAICT, if we have an inode stuck there waiting for a freeze
> to make progress in a page fault, it means the readdir holds the
> directory i_rwsem (IOLOCK) in read mode,
> 
> If this is the case, then the rename() syscall cannot get past
> vfs_rename->lock_rename() as that will block trying to get the
> directory i_rwsem in write mode that the readdir already holds.
> 
> ANd if we have the opposite, where we are in xfs_rename() waiting
> for XFS_ILOCK_EXCL on the directory inodes, it means that the VFS is
> holding the i_rwsem in write mode on the directory and hence readdir
> gets locked out.
> 
> i.e. the vfs level i_rwsem locking prevents xfs_readdir() and
> xfs_rename() being called on the same directory are the same time
> and so the nested page fault recursion scenario indicated here does
> not seem possible.

Hmm, I think I agree that we can't have sb_internal->dir_ilock at the
same time as dir_ilock->mmap_lock->sb_internal because the first will
want IOLOCK_EXCL and the second will want IOLOCK_SHARED.  IOWs, I agree
that it's a false positive, and I couldn't find anywhere in the VFS that
allows mixed read and write access to a directory.  The directory mod
operations (link/unlink/create/rename) all seem to inode_lock(), as does
setattr and the xattr functions.  The only code paths that I could find
that use inode_lock_shared are readdir and certain lookups.

(Why didn't lockdep show i_rwsem as part of this chain?)

However, I also think this breaks my mental model of lock acquisition
order in XFS.

The chain, as presented, is this:

> >    &mm->mmap_lock#2 --> sb_internal --> &xfs_dir_ilock_class

But let me rearrange that to cover what I think can happen in the
readdir case with this patch applied:

IOLOCK_SHARED         // directory
ILOCK_SHARED          // also directory

mmap_lock             // file
sb_pagefault          // __xfs_filemap_fault
MMAPLOCK_SHARED       // file
sb_intwrite           // xfs_iomap_write_direct
ILOCK_EXCL            // file

Up until now, I've always written code to acquire resources in the order
IOLOCK -> MMAPLOCK -> transaction -> ILOCK, and I think everyone else
has done that too.  Allowing this new readdir behavior is awkward
because now there /is/ a place where we can try to allocate a
transaction while holding an ILOCK.

Call me a bureaucrat, but I'd rather keep the bright line rule that we
don't take sb_internal with an ILOCK held than deal with more locking
model nuance. :)

--D

> > 3. Now we've made readdir take the directory ILOCK and do something that
> >    can cause a userspace pagefault.
> 
> Yup, and while that fault on the regular file is being handled,
> the rename cannot get past the VFS because of the the directory
> IOLOCK/i_rwsem is held...
> 
> > So with that in mind, can I get a re-review of the original patch?  I'll
> > add the above to the commit message as a justification for why we can't
> > just move the ilock/iunlock calls.
> 
> On the above, I think it's a false positive. Can you go through the
> analysis again and check that I haven't missed a case where the VFS
> allows concurrent read and write access to a single directory?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
  2021-12-16  4:56   ` Dave Chinner
  2021-12-17 18:59     ` Darrick J. Wong
@ 2022-01-05  0:09     ` Dave Chinner
  2022-01-05  1:38       ` Darrick J. Wong
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-01-05  0:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Dec 16, 2021 at 03:56:09PM +1100, Dave Chinner wrote:
> On Wed, Dec 15, 2021 at 05:09:21PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > I was poking around in the directory code while diagnosing online fsck
> > bugs, and noticed that xfs_readdir doesn't actually take the directory
> > ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
> > the data fork mappings
> 
> Yup, that is pretty much guaranteed. If the inode is extent or btree form as the
> extent count will be non-zero, hence we can only get to the
> xfs_dir2_isblock() check if the inode has moved from local to block
> form between the open and xfs_dir2_isblock() get in the getdents
> code.
> 
> > and the VFS took i_rwsem (aka IOLOCK_SHARED) so
> > we're protected against writer threads, but we really need to follow the
> > locking model like we do in other places.  The same applies to the
> > shortform getdents function.
> 
> Locking rules should be the same as xfs_dir_lookup().....

Ok, I assumed the locking in xfs_dir_lookup() is optimal....

.... which it turns out it isn't. All calls to xfs_dir_lookup()
occur with the directory locked at the VFS level, so the internal
contents of the directory can never change during a lookup. Hence
holding the ILOCK across the entire lookup is both unnecessary and
excessive.

What xfs_dir_lookup() should be doing is what xfs_readdir() is
largely already doing - just locking the ILOCK around buffer read
operations when we are mapping directory offsets to physical disk
locations and reading them from disk.  Changing this is a
significant set of changes, so it's not something that needs to be
done right now.

However, we still need to protect the xfs_dir2_isblock() lookup call
in xfs_readdir().

> > While we're at it, clean up the somewhat strange structure of this
> > function.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_dir2_readdir.c |   28 +++++++++++++++++-----------
> >  1 file changed, 17 insertions(+), 11 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> > index 8310005af00f..25560151c273 100644
> > --- a/fs/xfs/xfs_dir2_readdir.c
> > +++ b/fs/xfs/xfs_dir2_readdir.c
> > @@ -507,8 +507,9 @@ xfs_readdir(
> >  	size_t			bufsize)
> >  {
> >  	struct xfs_da_args	args = { NULL };
> > -	int			rval;
> > -	int			v;
> > +	unsigned int		lock_mode;
> > +	int			error;
> > +	int			isblock;
> >  
> >  	trace_xfs_readdir(dp);
> >  
> > @@ -522,14 +523,19 @@ xfs_readdir(
> >  	args.geo = dp->i_mount->m_dir_geo;
> >  	args.trans = tp;
> >  
> > -	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> > -		rval = xfs_dir2_sf_getdents(&args, ctx);
> > -	else if ((rval = xfs_dir2_isblock(&args, &v)))
> > -		;
> > -	else if (v)
> > -		rval = xfs_dir2_block_getdents(&args, ctx);
> > -	else
> > -		rval = xfs_dir2_leaf_getdents(&args, ctx, bufsize);
> > +	lock_mode = xfs_ilock_data_map_shared(dp);
> > +	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > +		xfs_iunlock(dp, lock_mode);
> > +		return xfs_dir2_sf_getdents(&args, ctx);
> > +	}

Directory inode format cannot change here, so we don't need to
hold the ILOCK at all to do shortform checks.

> >  
> > -	return rval;
> > +	error = xfs_dir2_isblock(&args, &isblock);
> > +	xfs_iunlock(dp, lock_mode);
> > +	if (error)
> > +		return error;
> > +
> > +	if (isblock)
> > +		return xfs_dir2_block_getdents(&args, ctx);

Can the xfs_dir2_isblock() call be moved into
xfs_dir2_block_getdents() where it already takes the ILOCK to read
the block?

> > +
> > +	return xfs_dir2_leaf_getdents(&args, ctx, bufsize);

Otherwise this patch is correct and this is where we should start
fixing the directory locking mess...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] xfs: take the ILOCK when accessing the inode core
  2022-01-05  0:09     ` Dave Chinner
@ 2022-01-05  1:38       ` Darrick J. Wong
  0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2022-01-05  1:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jan 05, 2022 at 11:09:47AM +1100, Dave Chinner wrote:
> On Thu, Dec 16, 2021 at 03:56:09PM +1100, Dave Chinner wrote:
> > On Wed, Dec 15, 2021 at 05:09:21PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > I was poking around in the directory code while diagnosing online fsck
> > > bugs, and noticed that xfs_readdir doesn't actually take the directory
> > > ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
> > > the data fork mappings
> > 
> > Yup, that is pretty much guaranteed. If the inode is extent or btree form as the
> > extent count will be non-zero, hence we can only get to the
> > xfs_dir2_isblock() check if the inode has moved from local to block
> > form between the open and xfs_dir2_isblock() get in the getdents
> > code.
> > 
> > > and the VFS took i_rwsem (aka IOLOCK_SHARED) so
> > > we're protected against writer threads, but we really need to follow the
> > > locking model like we do in other places.  The same applies to the
> > > shortform getdents function.
> > 
> > Locking rules should be the same as xfs_dir_lookup().....
> 
> Ok, I assumed the locking in xfs_dir_lookup() is optimal....
> 
> .... which it turns out it isn't. All calls to xfs_dir_lookup()
> occur with the directory locked at the VFS level, so the internal
> contents of the directory can never change during a lookup. Hence
> holding the ILOCK across the entire lookup is both unnecessary and
> excessive.
> 
> What xfs_dir_lookup() should be doing is what xfs_readdir() is
> largely already doing - just locking the ILOCK around buffer read
> operations when we are mapping directory offsets to physical disk
> locations and reading them from disk.  Changing this is a
> significant set of changes, so it's not something that needs to be
> done right now.
> 
> However, we still need to protect the xfs_dir2_isblock() lookup call
> in xfs_readdir().
> 
> > > While we're at it, clean up the somewhat strange structure of this
> > > function.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_dir2_readdir.c |   28 +++++++++++++++++-----------
> > >  1 file changed, 17 insertions(+), 11 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> > > index 8310005af00f..25560151c273 100644
> > > --- a/fs/xfs/xfs_dir2_readdir.c
> > > +++ b/fs/xfs/xfs_dir2_readdir.c
> > > @@ -507,8 +507,9 @@ xfs_readdir(
> > >  	size_t			bufsize)
> > >  {
> > >  	struct xfs_da_args	args = { NULL };
> > > -	int			rval;
> > > -	int			v;
> > > +	unsigned int		lock_mode;
> > > +	int			error;
> > > +	int			isblock;
> > >  
> > >  	trace_xfs_readdir(dp);
> > >  
> > > @@ -522,14 +523,19 @@ xfs_readdir(
> > >  	args.geo = dp->i_mount->m_dir_geo;
> > >  	args.trans = tp;
> > >  
> > > -	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> > > -		rval = xfs_dir2_sf_getdents(&args, ctx);
> > > -	else if ((rval = xfs_dir2_isblock(&args, &v)))
> > > -		;
> > > -	else if (v)
> > > -		rval = xfs_dir2_block_getdents(&args, ctx);
> > > -	else
> > > -		rval = xfs_dir2_leaf_getdents(&args, ctx, bufsize);
> > > +	lock_mode = xfs_ilock_data_map_shared(dp);
> > > +	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
> > > +		xfs_iunlock(dp, lock_mode);
> > > +		return xfs_dir2_sf_getdents(&args, ctx);
> > > +	}
> 
> Directory inode format cannot change here, so we don't need to
> hold the ILOCK at all to do shortform checks.

Ok.

> > >  
> > > -	return rval;
> > > +	error = xfs_dir2_isblock(&args, &isblock);
> > > +	xfs_iunlock(dp, lock_mode);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	if (isblock)
> > > +		return xfs_dir2_block_getdents(&args, ctx);
> 
> Can the xfs_dir2_isblock() call be moved into
> xfs_dir2_block_getdents() where it already takes the ILOCK to read
> the block?

Yeah.

> > > +
> > > +	return xfs_dir2_leaf_getdents(&args, ctx, bufsize);
> 
> Otherwise this patch is correct and this is where we should start
> fixing the directory locking mess...

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-01-05  1:38 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-16  1:09 [PATCHSET 0/7] xfs: random fixes for 5.17 Darrick J. Wong
2021-12-16  1:09 ` [PATCH 1/7] xfs: take the ILOCK when accessing the inode core Darrick J. Wong
2021-12-16  4:56   ` Dave Chinner
2021-12-17 18:59     ` Darrick J. Wong
2021-12-21  1:08       ` Darrick J. Wong
2021-12-21  5:10         ` Dave Chinner
2022-01-04  1:19           ` Darrick J. Wong
2022-01-05  0:09     ` Dave Chinner
2022-01-05  1:38       ` Darrick J. Wong
2021-12-16  1:09 ` [PATCH 2/7] xfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
2021-12-16  4:57   ` Dave Chinner
2021-12-24  7:16   ` Christoph Hellwig
2021-12-16  1:09 ` [PATCH 3/7] xfs: fix a bug in the online fsck directory leaf1 bestcount check Darrick J. Wong
2021-12-16  5:05   ` Dave Chinner
2021-12-16 19:25     ` Darrick J. Wong
2021-12-16 21:17       ` Dave Chinner
2021-12-16 21:40         ` Darrick J. Wong
2021-12-16 22:04   ` Dave Chinner
2021-12-24  7:17   ` Christoph Hellwig
2021-12-16  1:09 ` [PATCH 4/7] xfs: prevent UAF in xfs_log_item_in_current_chkpt Darrick J. Wong
2021-12-16  4:36   ` Dave Chinner
2021-12-16 16:35     ` Darrick J. Wong
2021-12-16  1:09 ` [PATCH 5/7] xfs: fix quotaoff mutex usage now that we don't support disabling it Darrick J. Wong
2021-12-16  5:07   ` Dave Chinner
2021-12-24  7:17   ` Christoph Hellwig
2021-12-16  1:09 ` [PATCH 6/7] xfs: don't expose internal symlink metadata buffers to the vfs Darrick J. Wong
2021-12-16  5:11   ` Dave Chinner
2021-12-17  2:58     ` Ian Kent
2021-12-24  7:22   ` Christoph Hellwig
2021-12-16  1:09 ` [PATCH 7/7] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
2021-12-16  4:41   ` Dave Chinner
2021-12-24  7:18   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.