All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: fix mount vs shrinker race
@ 2018-03-20  5:00 Dave Chinner
  2018-03-20  5:00 ` [PATCH 1/2] xfs: add mount delay debug option Dave Chinner
  2018-03-20  5:00 ` [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2018-03-20  5:00 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

These two patches exercise and fix a mount vs shrinker race that can
result in an oops. The first patch adds a delay after sufficient
superblock state has been set up to allow the VFS superblock
shrinker to run, but not enough XFS state for the shrinker to run
without crashing. The second patch fixes the problem.

Thoughts, comments?

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] xfs: add mount delay debug option
  2018-03-20  5:00 [PATCH 0/2] xfs: fix mount vs shrinker race Dave Chinner
@ 2018-03-20  5:00 ` Dave Chinner
  2018-03-20 12:00   ` Brian Foster
  2018-03-20  5:00 ` [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-03-20  5:00 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Similar to log_recovery_delay, this delay occurs between the VFS
superblock being initialised and the xfs_mount being fully
initialised. It also poisons the per-ag radix tree node so that it
can be used for triggering shrinker races during mount
such as the following:

<run memory pressure workload in background>

$ cat dirty-mount.sh
#! /bin/bash

umount -f /dev/pmem0
mkfs.xfs -f /dev/pmem0
mount /dev/pmem0 /mnt/test
rm -f /mnt/test/foo
xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
umount /dev/pmem0

# let's crash it now!
echo 30 > /sys/fs/xfs/debug/mount_delay
mount /dev/pmem0 /mnt/test
echo 0 > /sys/fs/xfs/debug/mount_delay
umount /dev/pmem0
$ sudo ./dirty-mount.sh
.....
[   60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G      D W        4.16.0-rc5-dgc #440
[   60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
[   60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282
[   60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a
[   60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000
[   60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5
[   60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598
[   60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458
[   60.390827] FS:  00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000
[   60.392253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0
[   60.394519] Call Trace:
[   60.395252]  radix_tree_gang_lookup_tag+0xc4/0x130
[   60.395948]  xfs_perag_get_tag+0x37/0xf0
[   60.396522]  xfs_reclaim_inodes_count+0x32/0x40
[   60.397178]  xfs_fs_nr_cached_objects+0x11/0x20
[   60.397837]  super_cache_count+0x35/0xc0
[   60.399159]  shrink_slab.part.66+0xb1/0x370
[   60.400194]  shrink_node+0x7e/0x1a0
[   60.401058]  try_to_free_pages+0x199/0x470
[   60.402081]  __alloc_pages_slowpath+0x3a1/0xd20
[   60.403729]  __alloc_pages_nodemask+0x1c3/0x200
[   60.404941]  cache_grow_begin+0x20b/0x2e0
[   60.406164]  fallback_alloc+0x160/0x200
[   60.407088]  kmem_cache_alloc+0x111/0x4e0
[   60.408038]  ? xfs_buf_rele+0x61/0x430
[   60.408925]  kmem_zone_alloc+0x61/0xe0
[   60.409965]  xfs_inode_alloc+0x24/0x1d0
.....


Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_globals.c |  1 +
 fs/xfs/xfs_super.c   | 12 ++++++++++++
 fs/xfs/xfs_sysctl.h  |  1 +
 fs/xfs/xfs_sysfs.c   | 31 +++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+)

diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index 3e1cc3001bcb..fdde17a2333c 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -47,6 +47,7 @@ xfs_param_t xfs_params = {
 
 struct xfs_globals xfs_globals = {
 	.log_recovery_delay	=	0,	/* no delay by default */
+	.mount_delay		=	0,	/* no delay by default */
 #ifdef XFS_ASSERT_FATAL
 	.bug_on_assert		=	true,	/* assert failures BUG() */
 #else
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4944fe37fc95..ee26437dc567 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1617,6 +1617,18 @@ xfs_fs_fill_super(
 #endif
 	sb->s_op = &xfs_super_operations;
 
+	/*
+	 * Delay mount work if the debug hook is set. This is debug
+	 * instrumention to coordinate simulation of xfs mount failures with
+	 * VFS superblock operations
+	 */
+	if (xfs_globals.mount_delay) {
+		memset(&mp->m_perag_tree, 0xa5, sizeof(mp->m_perag_tree));
+		xfs_notice(mp, "Delaying mount for %d seconds.",
+			xfs_globals.mount_delay);
+		msleep(xfs_globals.mount_delay * 1000);
+	}
+
 	if (silent)
 		flags |= XFS_MFSI_QUIET;
 
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index 82afee005140..b53a33e69932 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -95,6 +95,7 @@ extern xfs_param_t	xfs_params;
 
 struct xfs_globals {
 	int	log_recovery_delay;	/* log recovery delay (secs) */
+	int	mount_delay;		/* mount setup delay (secs) */
 	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
 };
 extern struct xfs_globals	xfs_globals;
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index 8b2ccc234f36..2d5cd2529f8e 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -165,9 +165,40 @@ log_recovery_delay_show(
 }
 XFS_SYSFS_ATTR_RW(log_recovery_delay);
 
+STATIC ssize_t
+mount_delay_store(
+	struct kobject	*kobject,
+	const char	*buf,
+	size_t		count)
+{
+	int		ret;
+	int		val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val < 0 || val > 60)
+		return -EINVAL;
+
+	xfs_globals.mount_delay = val;
+
+	return count;
+}
+
+STATIC ssize_t
+mount_delay_show(
+	struct kobject	*kobject,
+	char		*buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.mount_delay);
+}
+XFS_SYSFS_ATTR_RW(mount_delay);
+
 static struct attribute *xfs_dbg_attrs[] = {
 	ATTR_LIST(bug_on_assert),
 	ATTR_LIST(log_recovery_delay),
+	ATTR_LIST(mount_delay),
 	NULL,
 };
 
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes
  2018-03-20  5:00 [PATCH 0/2] xfs: fix mount vs shrinker race Dave Chinner
  2018-03-20  5:00 ` [PATCH 1/2] xfs: add mount delay debug option Dave Chinner
@ 2018-03-20  5:00 ` Dave Chinner
  2018-03-20 12:08   ` Brian Foster
  2018-03-21 17:25   ` Darrick J. Wong
  1 sibling, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2018-03-20  5:00 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We recently came across an oops on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.

We found a mount in a failed state, blocked on teh shrinker rwsem
here:

mount_bdev()
  deactivate_locked_super()
    unregister_shrinker()

Essentially, the machine was under memory pressure when the mount
was being run, xfs_fs_fill_super() failed after allocating the
xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
freed the xfs_mount, but the sb->s_fs_info field still pointed to
the freed memory. Hence when the superblock shrinker then ran
it fell off the bad pointer.

Setting sb->s_fs_info to NULL in this case won't solve the problem;
we'll still crash on a null pointer in xfs_reclaim_inodes_count().
The problem is that the superblock shrinker is running before the
filesystem structures it depends on have been fully set up. i.e.
the shrinker is registered in sget(), before ->fill_super() has been
called, and the shrinker can call into the filesystem before
fill_super() does it's setup work.

The twist here is that we need the sb shrinker to run in the XFS
mount process because of the memory pressure that log recovery and
quota check can create. Hence we need to prevent the indoe cache
scanning from occurring until after we've set up the per-ag
structures and caches that it depends on. We also need to ensure
that we clear the flag and sb->s_fs_info on failure so that we
can do the right thing in all cases in xfs_reclaim_inodes_count().

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.c | 2 ++
 fs/xfs/xfs_mount.h | 1 +
 fs/xfs/xfs_super.c | 9 +++++++++
 3 files changed, 12 insertions(+)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 6f5a5e6764d8..35a263c90fc0 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -155,6 +155,7 @@ xfs_free_perag(
 	xfs_agnumber_t	agno;
 	struct xfs_perag *pag;
 
+	mp->m_flags &= ~XFS_MOUNT_PERAG_DONE;
 	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
 		spin_lock(&mp->m_perag_lock);
 		pag = radix_tree_delete(&mp->m_perag_tree, agno);
@@ -244,6 +245,7 @@ xfs_initialize_perag(
 		*maxagi = index;
 
 	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
+	mp->m_flags |= XFS_MOUNT_PERAG_DONE;
 	return 0;
 
 out_hash_destroy:
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 1808f56decaa..3cf7309b615d 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -239,6 +239,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_FILESTREAMS	(1ULL << 24)	/* enable the filestreams
 						   allocator */
 #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
+#define XFS_MOUNT_PERAG_DONE	(1ULL << 26)	/* perag icache init done */
 
 #define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ee26437dc567..d3283e8c2ce2 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1755,6 +1755,7 @@ xfs_fs_fill_super(
  out_close_devices:
 	xfs_close_devices(mp);
  out_free_fsname:
+	sb->s_fs_info = NULL;
 	xfs_free_fsname(mp);
 	kfree(mp);
  out:
@@ -1800,6 +1801,14 @@ xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
 	struct shrink_control	*sc)
 {
+	/*
+	 * Don't do anything until the filesystem is fully set up, or in the
+	 * process of being torn down due to a mount failure.
+	 */
+	if (!sb->s_fs_info)
+		return 0;
+	if (!(XFS_M(sb)->m_flags & XFS_MOUNT_PERAG_DONE))
+		return 0;
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
 
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: add mount delay debug option
  2018-03-20  5:00 ` [PATCH 1/2] xfs: add mount delay debug option Dave Chinner
@ 2018-03-20 12:00   ` Brian Foster
  2018-03-20 13:12     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Brian Foster @ 2018-03-20 12:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 20, 2018 at 04:00:20PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to log_recovery_delay, this delay occurs between the VFS
> superblock being initialised and the xfs_mount being fully
> initialised. It also poisons the per-ag radix tree node so that it
> can be used for triggering shrinker races during mount
> such as the following:
> 
> <run memory pressure workload in background>
> 
> $ cat dirty-mount.sh
> #! /bin/bash
> 
> umount -f /dev/pmem0
> mkfs.xfs -f /dev/pmem0
> mount /dev/pmem0 /mnt/test
> rm -f /mnt/test/foo
> xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
> umount /dev/pmem0
> 
> # let's crash it now!
> echo 30 > /sys/fs/xfs/debug/mount_delay
> mount /dev/pmem0 /mnt/test
> echo 0 > /sys/fs/xfs/debug/mount_delay
> umount /dev/pmem0
> $ sudo ./dirty-mount.sh
> .....

Planning to post a test for this?

> [   60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G      D W        4.16.0-rc5-dgc #440
> [   60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [   60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
> [   60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282
> [   60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a
> [   60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000
> [   60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5
> [   60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598
> [   60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458
> [   60.390827] FS:  00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000
> [   60.392253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0

Was the beginning of this error splat snipped out? It might be useful to
include that and perhaps instead snip out some of the specific register
context above. Otherwise looks fine:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> [   60.394519] Call Trace:
> [   60.395252]  radix_tree_gang_lookup_tag+0xc4/0x130
> [   60.395948]  xfs_perag_get_tag+0x37/0xf0
> [   60.396522]  xfs_reclaim_inodes_count+0x32/0x40
> [   60.397178]  xfs_fs_nr_cached_objects+0x11/0x20
> [   60.397837]  super_cache_count+0x35/0xc0
> [   60.399159]  shrink_slab.part.66+0xb1/0x370
> [   60.400194]  shrink_node+0x7e/0x1a0
> [   60.401058]  try_to_free_pages+0x199/0x470
> [   60.402081]  __alloc_pages_slowpath+0x3a1/0xd20
> [   60.403729]  __alloc_pages_nodemask+0x1c3/0x200
> [   60.404941]  cache_grow_begin+0x20b/0x2e0
> [   60.406164]  fallback_alloc+0x160/0x200
> [   60.407088]  kmem_cache_alloc+0x111/0x4e0
> [   60.408038]  ? xfs_buf_rele+0x61/0x430
> [   60.408925]  kmem_zone_alloc+0x61/0xe0
> [   60.409965]  xfs_inode_alloc+0x24/0x1d0
> .....
> 
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_globals.c |  1 +
>  fs/xfs/xfs_super.c   | 12 ++++++++++++
>  fs/xfs/xfs_sysctl.h  |  1 +
>  fs/xfs/xfs_sysfs.c   | 31 +++++++++++++++++++++++++++++++
>  4 files changed, 45 insertions(+)
> 
> diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
> index 3e1cc3001bcb..fdde17a2333c 100644
> --- a/fs/xfs/xfs_globals.c
> +++ b/fs/xfs/xfs_globals.c
> @@ -47,6 +47,7 @@ xfs_param_t xfs_params = {
>  
>  struct xfs_globals xfs_globals = {
>  	.log_recovery_delay	=	0,	/* no delay by default */
> +	.mount_delay		=	0,	/* no delay by default */
>  #ifdef XFS_ASSERT_FATAL
>  	.bug_on_assert		=	true,	/* assert failures BUG() */
>  #else
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 4944fe37fc95..ee26437dc567 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1617,6 +1617,18 @@ xfs_fs_fill_super(
>  #endif
>  	sb->s_op = &xfs_super_operations;
>  
> +	/*
> +	 * Delay mount work if the debug hook is set. This is debug
> +	 * instrumention to coordinate simulation of xfs mount failures with
> +	 * VFS superblock operations
> +	 */
> +	if (xfs_globals.mount_delay) {
> +		memset(&mp->m_perag_tree, 0xa5, sizeof(mp->m_perag_tree));
> +		xfs_notice(mp, "Delaying mount for %d seconds.",
> +			xfs_globals.mount_delay);
> +		msleep(xfs_globals.mount_delay * 1000);
> +	}
> +
>  	if (silent)
>  		flags |= XFS_MFSI_QUIET;
>  
> diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
> index 82afee005140..b53a33e69932 100644
> --- a/fs/xfs/xfs_sysctl.h
> +++ b/fs/xfs/xfs_sysctl.h
> @@ -95,6 +95,7 @@ extern xfs_param_t	xfs_params;
>  
>  struct xfs_globals {
>  	int	log_recovery_delay;	/* log recovery delay (secs) */
> +	int	mount_delay;		/* mount setup delay (secs) */
>  	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
>  };
>  extern struct xfs_globals	xfs_globals;
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index 8b2ccc234f36..2d5cd2529f8e 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -165,9 +165,40 @@ log_recovery_delay_show(
>  }
>  XFS_SYSFS_ATTR_RW(log_recovery_delay);
>  
> +STATIC ssize_t
> +mount_delay_store(
> +	struct kobject	*kobject,
> +	const char	*buf,
> +	size_t		count)
> +{
> +	int		ret;
> +	int		val;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	if (val < 0 || val > 60)
> +		return -EINVAL;
> +
> +	xfs_globals.mount_delay = val;
> +
> +	return count;
> +}
> +
> +STATIC ssize_t
> +mount_delay_show(
> +	struct kobject	*kobject,
> +	char		*buf)
> +{
> +	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.mount_delay);
> +}
> +XFS_SYSFS_ATTR_RW(mount_delay);
> +
>  static struct attribute *xfs_dbg_attrs[] = {
>  	ATTR_LIST(bug_on_assert),
>  	ATTR_LIST(log_recovery_delay),
> +	ATTR_LIST(mount_delay),
>  	NULL,
>  };
>  
> -- 
> 2.16.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes
  2018-03-20  5:00 ` [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes Dave Chinner
@ 2018-03-20 12:08   ` Brian Foster
  2018-03-20 13:21     ` Dave Chinner
  2018-03-21 17:25   ` Darrick J. Wong
  1 sibling, 1 reply; 11+ messages in thread
From: Brian Foster @ 2018-03-20 12:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 20, 2018 at 04:00:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We recently came across an oops on a 4.14 kernel in
> xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> and so the m_perag_tree lookup walked into lala land.
> 
> We found a mount in a failed state, blocked on teh shrinker rwsem
> here:
> 
> mount_bdev()
>   deactivate_locked_super()
>     unregister_shrinker()
> 
> Essentially, the machine was under memory pressure when the mount
> was being run, xfs_fs_fill_super() failed after allocating the
> xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
> freed the xfs_mount, but the sb->s_fs_info field still pointed to
> the freed memory. Hence when the superblock shrinker then ran
> it fell off the bad pointer.
> 

So this sounds like the root cause of the crash is a use after free and
not necessarily a "use before initialized," correct? I see that
->m_perag_tree should be zeroed on allocation, but I have no idea if the
radix tree code is robust enough to deal with that.

> Setting sb->s_fs_info to NULL in this case won't solve the problem;
> we'll still crash on a null pointer in xfs_reclaim_inodes_count().

But the patch adds a NULL check..? I think this could be worded more
clearly.

> The problem is that the superblock shrinker is running before the
> filesystem structures it depends on have been fully set up. i.e.
> the shrinker is registered in sget(), before ->fill_super() has been
> called, and the shrinker can call into the filesystem before
> fill_super() does it's setup work.
> 
> The twist here is that we need the sb shrinker to run in the XFS
> mount process because of the memory pressure that log recovery and
> quota check can create. Hence we need to prevent the indoe cache
> scanning from occurring until after we've set up the per-ag
> structures and caches that it depends on. We also need to ensure
> that we clear the flag and sb->s_fs_info on failure so that we
> can do the right thing in all cases in xfs_reclaim_inodes_count().
> 

Makes sense...

> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_mount.c | 2 ++
>  fs/xfs/xfs_mount.h | 1 +
>  fs/xfs/xfs_super.c | 9 +++++++++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 6f5a5e6764d8..35a263c90fc0 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -155,6 +155,7 @@ xfs_free_perag(
>  	xfs_agnumber_t	agno;
>  	struct xfs_perag *pag;
>  
> +	mp->m_flags &= ~XFS_MOUNT_PERAG_DONE;
>  	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>  		spin_lock(&mp->m_perag_lock);
>  		pag = radix_tree_delete(&mp->m_perag_tree, agno);
> @@ -244,6 +245,7 @@ xfs_initialize_perag(
>  		*maxagi = index;
>  
>  	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
> +	mp->m_flags |= XFS_MOUNT_PERAG_DONE;

But why not just move the radix tree root (and lock) initialization to
where the rest of the xfs_mount static structure initializations occur
(or at least the ones we expect to be accessible)? It looks to me that
intentionally occurs before the mp is attached to the superblock and
essentially fixes the problem the flag is trying to fix.

That comment aside..

>  	return 0;
>  
>  out_hash_destroy:
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 1808f56decaa..3cf7309b615d 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -239,6 +239,7 @@ typedef struct xfs_mount {
>  #define XFS_MOUNT_FILESTREAMS	(1ULL << 24)	/* enable the filestreams
>  						   allocator */
>  #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
> +#define XFS_MOUNT_PERAG_DONE	(1ULL << 26)	/* perag icache init done */

PERAG_DONE sounds odd to me. XFS_MOUNT_PERAG_INIT perhaps?

>  
>  #define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ee26437dc567..d3283e8c2ce2 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1755,6 +1755,7 @@ xfs_fs_fill_super(
>   out_close_devices:
>  	xfs_close_devices(mp);
>   out_free_fsname:
> +	sb->s_fs_info = NULL;
>  	xfs_free_fsname(mp);
>  	kfree(mp);
>   out:
> @@ -1800,6 +1801,14 @@ xfs_fs_nr_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> +	/*
> +	 * Don't do anything until the filesystem is fully set up, or in the

s/filesystem fully set up/perag infrastructure is initialized/ ?

"Fully set up" to me implies the mount is complete, where as noted in
the commit log the shrinker needs to be functional for log recovery and
quotacheck.

> +	 * process of being torn down due to a mount failure.
> +	 */
> +	if (!sb->s_fs_info)
> +		return 0;
> +	if (!(XFS_M(sb)->m_flags & XFS_MOUNT_PERAG_DONE))
> +		return 0;

	struct xfs_mount	*mp = XFS_M(sb);

	if (!mp || !(mp->m_flags & XFS_MOUNT_PERAG_DONE))
		return 0;
	return xfs_reclaim_inodes_count(mp);

Brian

>  	return xfs_reclaim_inodes_count(XFS_M(sb));
>  }
>  
> -- 
> 2.16.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: add mount delay debug option
  2018-03-20 12:00   ` Brian Foster
@ 2018-03-20 13:12     ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2018-03-20 13:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 20, 2018 at 08:00:24AM -0400, Brian Foster wrote:
> On Tue, Mar 20, 2018 at 04:00:20PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Similar to log_recovery_delay, this delay occurs between the VFS
> > superblock being initialised and the xfs_mount being fully
> > initialised. It also poisons the per-ag radix tree node so that it
> > can be used for triggering shrinker races during mount
> > such as the following:
> > 
> > <run memory pressure workload in background>
> > 
> > $ cat dirty-mount.sh
> > #! /bin/bash
> > 
> > umount -f /dev/pmem0
> > mkfs.xfs -f /dev/pmem0
> > mount /dev/pmem0 /mnt/test
> > rm -f /mnt/test/foo
> > xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo
> > umount /dev/pmem0
> > 
> > # let's crash it now!
> > echo 30 > /sys/fs/xfs/debug/mount_delay
> > mount /dev/pmem0 /mnt/test
> > echo 0 > /sys/fs/xfs/debug/mount_delay
> > umount /dev/pmem0
> > $ sudo ./dirty-mount.sh
> > .....
> 
> Planning to post a test for this?

Haven't written one yet, only really had time to diagnose and write
the fix so far.

> 
> > [   60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G      D W        4.16.0-rc5-dgc #440
> > [   60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > [   60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320
> > [   60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282
> > [   60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a
> > [   60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000
> > [   60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5
> > [   60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598
> > [   60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458
> > [   60.390827] FS:  00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000
> > [   60.392253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0
> 
> Was the beginning of this error splat snipped out? It might be useful to
> include that and perhaps instead snip out some of the specific register
> context above. Otherwise looks fine:

It was one of about 100 threads that smashed into the shrinker at
the same time. It was the most intact trace I could cut and
paste, and the actual oops lines were nowhere to be seen on the
console output....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes
  2018-03-20 12:08   ` Brian Foster
@ 2018-03-20 13:21     ` Dave Chinner
  2018-03-20 14:21       ` Brian Foster
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-03-20 13:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 20, 2018 at 08:08:20AM -0400, Brian Foster wrote:
> On Tue, Mar 20, 2018 at 04:00:21PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We recently came across an oops on a 4.14 kernel in
> > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> > and so the m_perag_tree lookup walked into lala land.
> > 
> > We found a mount in a failed state, blocked on teh shrinker rwsem
> > here:
> > 
> > mount_bdev()
> >   deactivate_locked_super()
> >     unregister_shrinker()
> > 
> > Essentially, the machine was under memory pressure when the mount
> > was being run, xfs_fs_fill_super() failed after allocating the
> > xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
> > freed the xfs_mount, but the sb->s_fs_info field still pointed to
> > the freed memory. Hence when the superblock shrinker then ran
> > it fell off the bad pointer.
> > 
> 
> So this sounds like the root cause of the crash is a use after free and
> not necessarily a "use before initialized," correct? I see that
> ->m_perag_tree should be zeroed on allocation, but I have no idea if the
> radix tree code is robust enough to deal with that.

Well, one symptom is use after free. It can also be used before it's
initialised.

> > Setting sb->s_fs_info to NULL in this case won't solve the problem;
> > we'll still crash on a null pointer in xfs_reclaim_inodes_count().
> 
> But the patch adds a NULL check..? I think this could be worded more
> clearly.

Yes, that catches the case would have been a user after free. :)

> > The problem is that the superblock shrinker is running before the
> > filesystem structures it depends on have been fully set up. i.e.
> > the shrinker is registered in sget(), before ->fill_super() has been
> > called, and the shrinker can call into the filesystem before
> > fill_super() does it's setup work.
> > 
> > The twist here is that we need the sb shrinker to run in the XFS
> > mount process because of the memory pressure that log recovery and
> > quota check can create. Hence we need to prevent the indoe cache
> > scanning from occurring until after we've set up the per-ag
> > structures and caches that it depends on. We also need to ensure
> > that we clear the flag and sb->s_fs_info on failure so that we
> > can do the right thing in all cases in xfs_reclaim_inodes_count().
> > 
> 
> Makes sense...
> 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_mount.c | 2 ++
> >  fs/xfs/xfs_mount.h | 1 +
> >  fs/xfs/xfs_super.c | 9 +++++++++
> >  3 files changed, 12 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index 6f5a5e6764d8..35a263c90fc0 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -155,6 +155,7 @@ xfs_free_perag(
> >  	xfs_agnumber_t	agno;
> >  	struct xfs_perag *pag;
> >  
> > +	mp->m_flags &= ~XFS_MOUNT_PERAG_DONE;
> >  	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> >  		spin_lock(&mp->m_perag_lock);
> >  		pag = radix_tree_delete(&mp->m_perag_tree, agno);
> > @@ -244,6 +245,7 @@ xfs_initialize_perag(
> >  		*maxagi = index;
> >  
> >  	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
> > +	mp->m_flags |= XFS_MOUNT_PERAG_DONE;
> 
> But why not just move the radix tree root (and lock) initialization to
> where the rest of the xfs_mount static structure initializations occur
> (or at least the ones we expect to be accessible)? It looks to me that
> intentionally occurs before the mp is attached to the superblock and
> essentially fixes the problem the flag is trying to fix.

Could be done that way, but I still think we've got a general
problem we need to deal with here. i.e. that we can't allow the
shrinker to do anything until we actually initialised the parts of
the xfs_mount that it uses.

I need to look further, because I think we're actually in a state
where the shrinker scan cannot run at all during ->fill_super, which
means we might actually be able to fix this at the VFS level by
checking the SB_ACTIVE flag. I need to look a bit further on that,
and if that's the case all this XFS stuff goes away....

> > +++ b/fs/xfs/xfs_mount.h
> > @@ -239,6 +239,7 @@ typedef struct xfs_mount {
> >  #define XFS_MOUNT_FILESTREAMS	(1ULL << 24)	/* enable the filestreams
> >  						   allocator */
> >  #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
> > +#define XFS_MOUNT_PERAG_DONE	(1ULL << 26)	/* perag icache init done */
> 
> PERAG_DONE sounds odd to me. XFS_MOUNT_PERAG_INIT perhaps?

Tried to keep it short. INIT_DONE is what I firs thought, then I
dropped the init bit...

> >  {
> > +	/*
> > +	 * Don't do anything until the filesystem is fully set up, or in the
> 
> s/filesystem fully set up/perag infrastructure is initialized/ ?
> 
> "Fully set up" to me implies the mount is complete, where as noted in
> the commit log the shrinker needs to be functional for log recovery and
> quotacheck.

Well, that used to be true (th equotacheck thing) but as I mentioned
above, I'm not sure the shrinker runs at all during quotacheck...

> > +	 * process of being torn down due to a mount failure.
> > +	 */
> > +	if (!sb->s_fs_info)
> > +		return 0;
> > +	if (!(XFS_M(sb)->m_flags & XFS_MOUNT_PERAG_DONE))
> > +		return 0;
> 
> 	struct xfs_mount	*mp = XFS_M(sb);
> 
> 	if (!mp || !(mp->m_flags & XFS_MOUNT_PERAG_DONE))
> 		return 0;

I'd much prefer an explicit check of s_fs_info here, because
then it's obvious exactly what we expect to be null. XFS_M() could
ahve more code in it than just a cast of sb->s_fs_info. BUt, like I
said, this code might just go away....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes
  2018-03-20 13:21     ` Dave Chinner
@ 2018-03-20 14:21       ` Brian Foster
  2018-03-20 22:33         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Brian Foster @ 2018-03-20 14:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 21, 2018 at 12:21:14AM +1100, Dave Chinner wrote:
> On Tue, Mar 20, 2018 at 08:08:20AM -0400, Brian Foster wrote:
> > On Tue, Mar 20, 2018 at 04:00:21PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > We recently came across an oops on a 4.14 kernel in
> > > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> > > and so the m_perag_tree lookup walked into lala land.
> > > 
> > > We found a mount in a failed state, blocked on teh shrinker rwsem
> > > here:
> > > 
> > > mount_bdev()
> > >   deactivate_locked_super()
> > >     unregister_shrinker()
> > > 
> > > Essentially, the machine was under memory pressure when the mount
> > > was being run, xfs_fs_fill_super() failed after allocating the
> > > xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
> > > freed the xfs_mount, but the sb->s_fs_info field still pointed to
> > > the freed memory. Hence when the superblock shrinker then ran
> > > it fell off the bad pointer.
> > > 
> > 
> > So this sounds like the root cause of the crash is a use after free and
> > not necessarily a "use before initialized," correct? I see that
> > ->m_perag_tree should be zeroed on allocation, but I have no idea if the
> > radix tree code is robust enough to deal with that.
> 
> Well, one symptom is use after free. It can also be used before it's
> initialised.
> 

Right.. but the user reproduced one was the use-after-free variant..?

> > > Setting sb->s_fs_info to NULL in this case won't solve the problem;
> > > we'll still crash on a null pointer in xfs_reclaim_inodes_count().
> > 
> > But the patch adds a NULL check..? I think this could be worded more
> > clearly.
> 
> Yes, that catches the case would have been a user after free. :)
> 

So what you're saying is that setting s_fs_info to NULL at least partly
solves the problem. ;P (But anyways, just wording confusion).

> > > The problem is that the superblock shrinker is running before the
> > > filesystem structures it depends on have been fully set up. i.e.
> > > the shrinker is registered in sget(), before ->fill_super() has been
> > > called, and the shrinker can call into the filesystem before
> > > fill_super() does it's setup work.
> > > 
> > > The twist here is that we need the sb shrinker to run in the XFS
> > > mount process because of the memory pressure that log recovery and
> > > quota check can create. Hence we need to prevent the indoe cache
> > > scanning from occurring until after we've set up the per-ag
> > > structures and caches that it depends on. We also need to ensure
> > > that we clear the flag and sb->s_fs_info on failure so that we
> > > can do the right thing in all cases in xfs_reclaim_inodes_count().
> > > 
> > 
> > Makes sense...
> > 
> > > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_mount.c | 2 ++
> > >  fs/xfs/xfs_mount.h | 1 +
> > >  fs/xfs/xfs_super.c | 9 +++++++++
> > >  3 files changed, 12 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index 6f5a5e6764d8..35a263c90fc0 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -155,6 +155,7 @@ xfs_free_perag(
> > >  	xfs_agnumber_t	agno;
> > >  	struct xfs_perag *pag;
> > >  
> > > +	mp->m_flags &= ~XFS_MOUNT_PERAG_DONE;
> > >  	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > >  		spin_lock(&mp->m_perag_lock);
> > >  		pag = radix_tree_delete(&mp->m_perag_tree, agno);
> > > @@ -244,6 +245,7 @@ xfs_initialize_perag(
> > >  		*maxagi = index;
> > >  
> > >  	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
> > > +	mp->m_flags |= XFS_MOUNT_PERAG_DONE;
> > 
> > But why not just move the radix tree root (and lock) initialization to
> > where the rest of the xfs_mount static structure initializations occur
> > (or at least the ones we expect to be accessible)? It looks to me that
> > intentionally occurs before the mp is attached to the superblock and
> > essentially fixes the problem the flag is trying to fix.
> 
> Could be done that way, but I still think we've got a general
> problem we need to deal with here. i.e. that we can't allow the
> shrinker to do anything until we actually initialised the parts of
> the xfs_mount that it uses.
> 

I actually think it might be useful to initialize the static/low-level
data structures (i.e., locks, radix bits, etc.) as such regardless.
There's only a couple places where we don't do that already (the perag
radix tree root being one), and that doesn't preclude higher level XFS
state management such as is being done by the XFS_MOUNT_PERAG_DONE flag,
if necessary. See the appended diff for reference.

I'm also wondering if there's value in doing a memset() (as your
previous patch) over the entire xfs_mount in DEBUG mode, immediately
after it's allocated, to facilitate catching this type of thing in the
future. That would require more of an audit to verify that we don't rely
on the kzalloc to correctly initialize certain fields (bools, ints,
etc.) though..

> I need to look further, because I think we're actually in a state
> where the shrinker scan cannot run at all during ->fill_super, which
> means we might actually be able to fix this at the VFS level by
> checking the SB_ACTIVE flag. I need to look a bit further on that,
> and if that's the case all this XFS stuff goes away....
> 

I'm pretty sure the dquot shrinker can run during quotacheck because we
had to make the previous delwri queue handling fixes to deal with that.
Though that shrinker is registered dynamically and so may have nothing
to do with the superblock inode shrinker for all I'm aware. Just a note,
FWIW.

Brian

--- 8< ---

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a55f7a45fa78..53433cc024fd 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -731,7 +731,6 @@ xfs_sb_mount_common(
 	struct xfs_sb	*sbp)
 {
 	mp->m_agfrotor = mp->m_agirotor = 0;
-	spin_lock_init(&mp->m_agirotor_lock);
 	mp->m_maxagi = mp->m_sb.sb_agcount;
 	mp->m_blkbit_log = sbp->sb_blocklog + XFS_NBBYLOG;
 	mp->m_blkbb_log = sbp->sb_blocklog - BBSHIFT;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 6f5a5e6764d8..a901b86772f8 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -817,8 +817,6 @@ xfs_mountfs(
 	/*
 	 * Allocate and initialize the per-ag data.
 	 */
-	spin_lock_init(&mp->m_perag_lock);
-	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
 	error = xfs_initialize_perag(mp, sbp->sb_agcount, &mp->m_maxagi);
 	if (error) {
 		xfs_warn(mp, "Failed per-ag init: %d", error);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 951271f57d00..3dd9544be9ec 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1579,29 +1579,44 @@ xfs_destroy_percpu_counters(
 	percpu_counter_destroy(&mp->m_fdblocks);
 }
 
-STATIC int
-xfs_fs_fill_super(
-	struct super_block	*sb,
-	void			*data,
-	int			silent)
+static struct xfs_mount *
+xfs_mount_alloc(
+	struct super_block	*sb)
 {
-	struct inode		*root;
-	struct xfs_mount	*mp = NULL;
-	int			flags = 0, error = -ENOMEM;
+	struct xfs_mount	*mp;
 
 	mp = kzalloc(sizeof(struct xfs_mount), GFP_KERNEL);
 	if (!mp)
-		goto out;
+		return NULL;
 
+	mp->m_super = sb;
 	spin_lock_init(&mp->m_sb_lock);
+	spin_lock_init(&mp->m_agirotor_lock);
+	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
+	spin_lock_init(&mp->m_perag_lock);
 	mutex_init(&mp->m_growlock);
 	atomic_set(&mp->m_active_trans, 0);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
 	INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
+	return mp;
+}
 
-	mp->m_super = sb;
+
+STATIC int
+xfs_fs_fill_super(
+	struct super_block	*sb,
+	void			*data,
+	int			silent)
+{
+	struct inode		*root;
+	struct xfs_mount	*mp = NULL;
+	int			flags = 0, error = -ENOMEM;
+
+	mp = xfs_mount_alloc(sb);
+	if (!mp)
+		goto out;
 	sb->s_fs_info = mp;
 
 	error = xfs_parseargs(mp, (char *)data);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes
  2018-03-20 14:21       ` Brian Foster
@ 2018-03-20 22:33         ` Dave Chinner
  2018-03-21 11:26           ` Brian Foster
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-03-20 22:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Mar 20, 2018 at 10:21:40AM -0400, Brian Foster wrote:
> On Wed, Mar 21, 2018 at 12:21:14AM +1100, Dave Chinner wrote:
> > On Tue, Mar 20, 2018 at 08:08:20AM -0400, Brian Foster wrote:
> > > On Tue, Mar 20, 2018 at 04:00:21PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > We recently came across an oops on a 4.14 kernel in
> > > > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> > > > and so the m_perag_tree lookup walked into lala land.
> > > > 
> > > > We found a mount in a failed state, blocked on teh shrinker rwsem
> > > > here:
> > > > 
> > > > mount_bdev()
> > > >   deactivate_locked_super()
> > > >     unregister_shrinker()
> > > > 
> > > > Essentially, the machine was under memory pressure when the mount
> > > > was being run, xfs_fs_fill_super() failed after allocating the
> > > > xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
> > > > freed the xfs_mount, but the sb->s_fs_info field still pointed to
> > > > the freed memory. Hence when the superblock shrinker then ran
> > > > it fell off the bad pointer.
> > > > 
> > > 
> > > So this sounds like the root cause of the crash is a use after free and
> > > not necessarily a "use before initialized," correct? I see that
> > > ->m_perag_tree should be zeroed on allocation, but I have no idea if the
> > > radix tree code is robust enough to deal with that.
> > 
> > Well, one symptom is use after free. It can also be used before it's
> > initialised.
> > 
> 
> Right.. but the user reproduced one was the use-after-free variant..?

Yeah, they are just different timings on the same issue. Basically,
If I were to fail xfs_fs_fill_super() after xfs_mountfs() completed
successfully and then delay in deactivate_locked_super() it would
reproduce the use after free rather than the
use-before-initialisation.

> > > > Setting sb->s_fs_info to NULL in this case won't solve the problem;
> > > > we'll still crash on a null pointer in xfs_reclaim_inodes_count().
> > > 
> > > But the patch adds a NULL check..? I think this could be worded more
> > > clearly.
> > 
> > Yes, that catches the case would have been a user after free. :)
> > 
> 
> So what you're saying is that setting s_fs_info to NULL at least partly
> solves the problem. ;P (But anyways, just wording confusion).

Yeah, just wording - "doesn't solve the whole problem" :)

> > > > The problem is that the superblock shrinker is running before the
> > > >  	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
> > > > +	mp->m_flags |= XFS_MOUNT_PERAG_DONE;
> > > 
> > > But why not just move the radix tree root (and lock) initialization to
> > > where the rest of the xfs_mount static structure initializations occur
> > > (or at least the ones we expect to be accessible)? It looks to me that
> > > intentionally occurs before the mp is attached to the superblock and
> > > essentially fixes the problem the flag is trying to fix.
> > 
> > Could be done that way, but I still think we've got a general
> > problem we need to deal with here. i.e. that we can't allow the
> > shrinker to do anything until we actually initialised the parts of
> > the xfs_mount that it uses.
> 
> I actually think it might be useful to initialize the static/low-level
> data structures (i.e., locks, radix bits, etc.) as such regardless.

Well, the radix tree is already initialised - it's completely zero,
so a lookup on a zeroed tree root should return NULL because the
root node is null. That's the purpose of the memset before the delay
- to ensure that a lookup executed by the shrinker actually dies a
horrible death (which it does!)

> There's only a couple places where we don't do that already (the perag
> radix tree root being one), and that doesn't preclude higher level XFS
> state management such as is being done by the XFS_MOUNT_PERAG_DONE flag,
> if necessary. See the appended diff for reference.
> 
> I'm also wondering if there's value in doing a memset() (as your
> previous patch) over the entire xfs_mount in DEBUG mode, immediately
> after it's allocated, to facilitate catching this type of thing in the
> future. That would require more of an audit to verify that we don't rely
> on the kzalloc to correctly initialize certain fields (bools, ints,
> etc.) though..

Yeah, the zeroing on allocation is done to ensure we don't have to
zero each element individually. I'd much prefer we leave the code
like that rather than introduce a whole new bug vector by having to
manually initialise fields to zero...

> > I need to look further, because I think we're actually in a state
> > where the shrinker scan cannot run at all during ->fill_super, which
> > means we might actually be able to fix this at the VFS level by
> > checking the SB_ACTIVE flag. I need to look a bit further on that,
> > and if that's the case all this XFS stuff goes away....
> > 
> 
> I'm pretty sure the dquot shrinker can run during quotacheck because we
> had to make the previous delwri queue handling fixes to deal with that.
> Though that shrinker is registered dynamically and so may have nothing
> to do with the superblock inode shrinker for all I'm aware. Just a note,
> FWIW.

We control the dquot shrinker ourselves, so we don't start it until
it's safe to have it run. This problem is purely one of the VFS
superblock shrinker where we don't directly control it's startup or
access to the filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes
  2018-03-20 22:33         ` Dave Chinner
@ 2018-03-21 11:26           ` Brian Foster
  0 siblings, 0 replies; 11+ messages in thread
From: Brian Foster @ 2018-03-21 11:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 21, 2018 at 09:33:07AM +1100, Dave Chinner wrote:
> On Tue, Mar 20, 2018 at 10:21:40AM -0400, Brian Foster wrote:
> > On Wed, Mar 21, 2018 at 12:21:14AM +1100, Dave Chinner wrote:
> > > On Tue, Mar 20, 2018 at 08:08:20AM -0400, Brian Foster wrote:
> > > > On Tue, Mar 20, 2018 at 04:00:21PM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > We recently came across an oops on a 4.14 kernel in
> > > > > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> > > > > and so the m_perag_tree lookup walked into lala land.
> > > > > 
> > > > > We found a mount in a failed state, blocked on teh shrinker rwsem
> > > > > here:
> > > > > 
> > > > > mount_bdev()
> > > > >   deactivate_locked_super()
> > > > >     unregister_shrinker()
> > > > > 
> > > > > Essentially, the machine was under memory pressure when the mount
> > > > > was being run, xfs_fs_fill_super() failed after allocating the
> > > > > xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
> > > > > freed the xfs_mount, but the sb->s_fs_info field still pointed to
> > > > > the freed memory. Hence when the superblock shrinker then ran
> > > > > it fell off the bad pointer.
> > > > > 
> > > > 
> > > > So this sounds like the root cause of the crash is a use after free and
> > > > not necessarily a "use before initialized," correct? I see that
> > > > ->m_perag_tree should be zeroed on allocation, but I have no idea if the
> > > > radix tree code is robust enough to deal with that.
> > > 
> > > Well, one symptom is use after free. It can also be used before it's
> > > initialised.
> > > 
> > 
> > Right.. but the user reproduced one was the use-after-free variant..?
> 
> Yeah, they are just different timings on the same issue. Basically,
> If I were to fail xfs_fs_fill_super() after xfs_mountfs() completed
> successfully and then delay in deactivate_locked_super() it would
> reproduce the use after free rather than the
> use-before-initialisation.
> 
> > > > > Setting sb->s_fs_info to NULL in this case won't solve the problem;
> > > > > we'll still crash on a null pointer in xfs_reclaim_inodes_count().
> > > > 
> > > > But the patch adds a NULL check..? I think this could be worded more
> > > > clearly.
> > > 
> > > Yes, that catches the case would have been a user after free. :)
> > > 
> > 
> > So what you're saying is that setting s_fs_info to NULL at least partly
> > solves the problem. ;P (But anyways, just wording confusion).
> 
> Yeah, just wording - "doesn't solve the whole problem" :)
> 
> > > > > The problem is that the superblock shrinker is running before the
> > > > >  	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
> > > > > +	mp->m_flags |= XFS_MOUNT_PERAG_DONE;
> > > > 
> > > > But why not just move the radix tree root (and lock) initialization to
> > > > where the rest of the xfs_mount static structure initializations occur
> > > > (or at least the ones we expect to be accessible)? It looks to me that
> > > > intentionally occurs before the mp is attached to the superblock and
> > > > essentially fixes the problem the flag is trying to fix.
> > > 
> > > Could be done that way, but I still think we've got a general
> > > problem we need to deal with here. i.e. that we can't allow the
> > > shrinker to do anything until we actually initialised the parts of
> > > the xfs_mount that it uses.
> > 
> > I actually think it might be useful to initialize the static/low-level
> > data structures (i.e., locks, radix bits, etc.) as such regardless.
> 
> Well, the radix tree is already initialised - it's completely zero,
> so a lookup on a zeroed tree root should return NULL because the
> root node is null. That's the purpose of the memset before the delay
> - to ensure that a lookup executed by the shrinker actually dies a
> horrible death (which it does!)
> 

We still have to call the tree init function (and other similar
initializers) one way or another. We already do much of that immediately
after allocation.

> > There's only a couple places where we don't do that already (the perag
> > radix tree root being one), and that doesn't preclude higher level XFS
> > state management such as is being done by the XFS_MOUNT_PERAG_DONE flag,
> > if necessary. See the appended diff for reference.
> > 
> > I'm also wondering if there's value in doing a memset() (as your
> > previous patch) over the entire xfs_mount in DEBUG mode, immediately
> > after it's allocated, to facilitate catching this type of thing in the
> > future. That would require more of an audit to verify that we don't rely
> > on the kzalloc to correctly initialize certain fields (bools, ints,
> > etc.) though..
> 
> Yeah, the zeroing on allocation is done to ensure we don't have to
> zero each element individually. I'd much prefer we leave the code
> like that rather than introduce a whole new bug vector by having to
> manually initialise fields to zero...
> 

I am a little curious if we rely on that in practice (i.e., are there
any zero/false fields that are not explicitly set by the time
xfs_mountfs() returns?), but Ok, fair enough.

Brian

> > > I need to look further, because I think we're actually in a state
> > > where the shrinker scan cannot run at all during ->fill_super, which
> > > means we might actually be able to fix this at the VFS level by
> > > checking the SB_ACTIVE flag. I need to look a bit further on that,
> > > and if that's the case all this XFS stuff goes away....
> > > 
> > 
> > I'm pretty sure the dquot shrinker can run during quotacheck because we
> > had to make the previous delwri queue handling fixes to deal with that.
> > Though that shrinker is registered dynamically and so may have nothing
> > to do with the superblock inode shrinker for all I'm aware. Just a note,
> > FWIW.
> 
> We control the dquot shrinker ourselves, so we don't start it until
> it's safe to have it run. This problem is purely one of the VFS
> superblock shrinker where we don't directly control it's startup or
> access to the filesystem.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes
  2018-03-20  5:00 ` [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes Dave Chinner
  2018-03-20 12:08   ` Brian Foster
@ 2018-03-21 17:25   ` Darrick J. Wong
  1 sibling, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2018-03-21 17:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 20, 2018 at 04:00:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We recently came across an oops on a 4.14 kernel in
> xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> and so the m_perag_tree lookup walked into lala land.
> 
> We found a mount in a failed state, blocked on teh shrinker rwsem
> here:
> 
> mount_bdev()
>   deactivate_locked_super()
>     unregister_shrinker()
> 
> Essentially, the machine was under memory pressure when the mount
> was being run, xfs_fs_fill_super() failed after allocating the
> xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
> freed the xfs_mount, but the sb->s_fs_info field still pointed to
> the freed memory. Hence when the superblock shrinker then ran
> it fell off the bad pointer.
> 
> Setting sb->s_fs_info to NULL in this case won't solve the problem;
> we'll still crash on a null pointer in xfs_reclaim_inodes_count().
> The problem is that the superblock shrinker is running before the
> filesystem structures it depends on have been fully set up. i.e.
> the shrinker is registered in sget(), before ->fill_super() has been
> called, and the shrinker can call into the filesystem before
> fill_super() does it's setup work.
> 
> The twist here is that we need the sb shrinker to run in the XFS
> mount process because of the memory pressure that log recovery and
> quota check can create. Hence we need to prevent the indoe cache
> scanning from occurring until after we've set up the per-ag
> structures and caches that it depends on. We also need to ensure
> that we clear the flag and sb->s_fs_info on failure so that we
> can do the right thing in all cases in xfs_reclaim_inodes_count().
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_mount.c | 2 ++
>  fs/xfs/xfs_mount.h | 1 +
>  fs/xfs/xfs_super.c | 9 +++++++++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 6f5a5e6764d8..35a263c90fc0 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -155,6 +155,7 @@ xfs_free_perag(
>  	xfs_agnumber_t	agno;
>  	struct xfs_perag *pag;
>  
> +	mp->m_flags &= ~XFS_MOUNT_PERAG_DONE;
>  	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>  		spin_lock(&mp->m_perag_lock);
>  		pag = radix_tree_delete(&mp->m_perag_tree, agno);
> @@ -244,6 +245,7 @@ xfs_initialize_perag(
>  		*maxagi = index;
>  
>  	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
> +	mp->m_flags |= XFS_MOUNT_PERAG_DONE;
>  	return 0;
>  
>  out_hash_destroy:
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 1808f56decaa..3cf7309b615d 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -239,6 +239,7 @@ typedef struct xfs_mount {
>  #define XFS_MOUNT_FILESTREAMS	(1ULL << 24)	/* enable the filestreams
>  						   allocator */
>  #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
> +#define XFS_MOUNT_PERAG_DONE	(1ULL << 26)	/* perag icache init done */
>  
>  #define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ee26437dc567..d3283e8c2ce2 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1755,6 +1755,7 @@ xfs_fs_fill_super(
>   out_close_devices:
>  	xfs_close_devices(mp);
>   out_free_fsname:
> +	sb->s_fs_info = NULL;
>  	xfs_free_fsname(mp);
>  	kfree(mp);
>   out:
> @@ -1800,6 +1801,14 @@ xfs_fs_nr_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> +	/*
> +	 * Don't do anything until the filesystem is fully set up, or in the
> +	 * process of being torn down due to a mount failure.
> +	 */
> +	if (!sb->s_fs_info)
> +		return 0;
> +	if (!(XFS_M(sb)->m_flags & XFS_MOUNT_PERAG_DONE))
> +		return 0;

I wonder, if we apply Brian's patch to kzalloc-and-init the mp, can we
replace XFS_MOUNT_PERAG_DONE flag checking with

	if (radix_tree_empty(&mp->m_perag_tree))
		return 0;

instead?

--D

>  	return xfs_reclaim_inodes_count(XFS_M(sb));
>  }
>  
> -- 
> 2.16.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-03-21 17:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-20  5:00 [PATCH 0/2] xfs: fix mount vs shrinker race Dave Chinner
2018-03-20  5:00 ` [PATCH 1/2] xfs: add mount delay debug option Dave Chinner
2018-03-20 12:00   ` Brian Foster
2018-03-20 13:12     ` Dave Chinner
2018-03-20  5:00 ` [PATCH 2/2] xfs: don't shrink the inode cache until after setup completes Dave Chinner
2018-03-20 12:08   ` Brian Foster
2018-03-20 13:21     ` Dave Chinner
2018-03-20 14:21       ` Brian Foster
2018-03-20 22:33         ` Dave Chinner
2018-03-21 11:26           ` Brian Foster
2018-03-21 17:25   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.