linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] fs: don't scan the inode cache before SB_BORN is set
@ 2018-05-10  4:21 Dave Chinner
  2018-05-10 16:39 ` Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Dave Chinner @ 2018-05-10  4:21 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, Al Viro


From: Dave Chinner <dchinner@redhat.com>

We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.

We found a mount in a failed state, blocked on the shrinker rwsem
here:

mount_bdev()
  deactivate_locked_super()
    unregister_shrinker()

Essentially, the machine was under memory pressure when the mount
was being run, xfs_fs_fill_super() failed after allocating the
xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
freed the xfs_mount, but the sb->s_fs_info field still pointed to
the freed memory. Hence when the superblock shrinker then ran
it fell off the bad pointer.

However, we also saw another manifestation of the same problem - the
shrinker can fall off a bad pointer if it runs before the superblock
is fully set up - a use before initialisation problem. This
typically crashed somewhere in the radix tree manipulations in
this path:

  radix_tree_gang_lookup_tag+0xc4/0x130
  xfs_perag_get_tag+0x37/0xf0
  xfs_reclaim_inodes_count+0x32/0x40
  xfs_fs_nr_cached_objects+0x11/0x20
  super_cache_count+0x35/0xc0
  shrink_slab.part.66+0xb1/0x370
  shrink_node+0x7e/0x1a0
  try_to_free_pages+0x199/0x470
  __alloc_pages_slowpath+0x3a1/0xd20
  __alloc_pages_nodemask+0x1c3/0x200
  cache_grow_begin+0x20b/0x2e0
  fallback_alloc+0x160/0x200
  kmem_cache_alloc+0x111/0x4e0

The underlying problem is that the superblock shrinker is running
before the filesystem structures it depends on have been fully set
up. i.e.  the shrinker is registered in sget(), before
->fill_super() has been called, and the shrinker can call into the
filesystem before fill_super() does it's setup work.

Setting sb->s_fs_info to NULL on xfs_mount setup failure only solves
the use-after-free part of the problem - it doesn't solve the
use-before-initialisation part. To solve that we need to check the
SB_BORN flag in super_cache_count().

The SB_BORN flag is not set until ->fs_mount() completes
successfully and trylock_super() won't succeed until it is set.
Hence super_cache_scan() will not run until SB_BORN is set, so it
makes sense to not allow super_cache_scan to run and enter the
filesystem until it is set, too. This prevents the superblock
shrinker from entering the filesystem while it is being set up and
so avoids the use-before-initialisation issue.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
Version 3:
- change the memory barriers to protect the superblock data, not
  the SB_BORN flag.

Version 2:
- convert to use SB_BORN, not SB_ACTIVE
- add memory barriers
- rework comment in super_cache_count()

---
 fs/super.c         | 30 ++++++++++++++++++++++++------
 fs/xfs/xfs_super.c | 11 +++++++++++
 2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 122c402049a2..4b5b562176d0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -121,13 +121,23 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	/*
-	 * Don't call trylock_super as it is a potential
-	 * scalability bottleneck. The counts could get updated
-	 * between super_cache_count and super_cache_scan anyway.
-	 * Call to super_cache_count with shrinker_rwsem held
-	 * ensures the safety of call to list_lru_shrink_count() and
-	 * s_op->nr_cached_objects().
+	 * We don't call trylock_super() here as it is a scalability bottleneck,
+	 * so we're exposed to partial setup state. The shrinker rwsem does not
+	 * protect filesystem operations backing list_lru_shrink_count() or
+	 * s_op->nr_cached_objects(). Counts can change between
+	 * super_cache_count and super_cache_scan, so we really don't need locks
+	 * here.
+	 *
+	 * However, if we are currently mounting the superblock, the underlying
+	 * filesystem might be in a state of partial construction and hence it
+	 * is dangerous to access it.  trylock_super() uses a SB_BORN check to
+	 * avoid this situation, so do the same here. The memory barrier is
+	 * matched with the one in mount_fs() as we don't hold locks here.
 	 */
+	if (!(sb->s_flags & SB_BORN))
+		return 0;
+	smp_rmb();
+
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
@@ -1272,6 +1282,14 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
 	sb = root->d_sb;
 	BUG_ON(!sb);
 	WARN_ON(!sb->s_bdi);
+
+	/*
+	 * Write barrier is for super_cache_count(). We place it before setting
+	 * SB_BORN as the data dependency between the two functions is the
+	 * superblock structure contents that we just set up, not the SB_BORN
+	 * flag.
+	 */
+	smp_wmb();
 	sb->s_flags |= SB_BORN;
 
 	error = security_sb_kern_mount(sb, flags, secdata);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a523eaeb3f29..005386f1499e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1772,6 +1772,8 @@ xfs_fs_fill_super(
  out_close_devices:
 	xfs_close_devices(mp);
  out_free_fsname:
+	sb->s_fs_info = NULL;
+	sb->s_op = NULL;
 	xfs_free_fsname(mp);
 	kfree(mp);
  out:
@@ -1798,6 +1800,9 @@ xfs_fs_put_super(
 	xfs_destroy_percpu_counters(mp);
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
+
+	sb->s_fs_info = NULL;
+	sb->s_op = NULL;
 	xfs_free_fsname(mp);
 	kfree(mp);
 }
@@ -1817,6 +1822,12 @@ xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
 	struct shrink_control	*sc)
 {
+	/*
+	 * Don't do anything until the filesystem is fully set up, or in the
+	 * process of being torn down due to a mount failure.
+	 */
+	if (!sb->s_fs_info)
+		return 0;
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] fs: don't scan the inode cache before SB_BORN is set
  2018-05-10  4:21 [PATCH v3] fs: don't scan the inode cache before SB_BORN is set Dave Chinner
@ 2018-05-10 16:39 ` Darrick J. Wong
  2018-05-10 19:09 ` Al Viro
  2018-05-11  1:20 ` [PATCH v4] " Dave Chinner
  2 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2018-05-10 16:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, Al Viro

On Thu, May 10, 2018 at 02:21:33PM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> We recently had an oops reported on a 4.14 kernel in
> xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> and so the m_perag_tree lookup walked into lala land.
> 
> We found a mount in a failed state, blocked on the shrinker rwsem
> here:
> 
> mount_bdev()
>   deactivate_locked_super()
>     unregister_shrinker()
> 
> Essentially, the machine was under memory pressure when the mount
> was being run, xfs_fs_fill_super() failed after allocating the
> xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
> freed the xfs_mount, but the sb->s_fs_info field still pointed to
> the freed memory. Hence when the superblock shrinker then ran
> it fell off the bad pointer.
> 
> However, we also saw another manifestation of the same problem - the
> shrinker can fall off a bad pointer if it runs before the superblock
> is fully set up - a use before initialisation problem. This
> typically crashed somewhere in the radix tree manipulations in
> this path:
> 
>   radix_tree_gang_lookup_tag+0xc4/0x130
>   xfs_perag_get_tag+0x37/0xf0
>   xfs_reclaim_inodes_count+0x32/0x40
>   xfs_fs_nr_cached_objects+0x11/0x20
>   super_cache_count+0x35/0xc0
>   shrink_slab.part.66+0xb1/0x370
>   shrink_node+0x7e/0x1a0
>   try_to_free_pages+0x199/0x470
>   __alloc_pages_slowpath+0x3a1/0xd20
>   __alloc_pages_nodemask+0x1c3/0x200
>   cache_grow_begin+0x20b/0x2e0
>   fallback_alloc+0x160/0x200
>   kmem_cache_alloc+0x111/0x4e0
> 
> The underlying problem is that the superblock shrinker is running
> before the filesystem structures it depends on have been fully set
> up. i.e.  the shrinker is registered in sget(), before
> ->fill_super() has been called, and the shrinker can call into the
> filesystem before fill_super() does it's setup work.
> 
> Setting sb->s_fs_info to NULL on xfs_mount setup failure only solves
> the use-after-free part of the problem - it doesn't solve the
> use-before-initialisation part. To solve that we need to check the
> SB_BORN flag in super_cache_count().
> 
> The SB_BORN flag is not set until ->fs_mount() completes
> successfully and trylock_super() won't succeed until it is set.
> Hence super_cache_scan() will not run until SB_BORN is set, so it
> makes sense to not allow super_cache_scan to run and enter the
> filesystem until it is set, too. This prevents the superblock
> shrinker from entering the filesystem while it is being set up and
> so avoids the use-before-initialisation issue.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>

Looks ok, will give it a spin,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
> Version 3:
> - change the memory barriers to protect the superblock data, not
>   the SB_BORN flag.
> 
> Version 2:
> - convert to use SB_BORN, not SB_ACTIVE
> - add memory barriers
> - rework comment in super_cache_count()
> 
> ---
>  fs/super.c         | 30 ++++++++++++++++++++++++------
>  fs/xfs/xfs_super.c | 11 +++++++++++
>  2 files changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 122c402049a2..4b5b562176d0 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -121,13 +121,23 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>  	sb = container_of(shrink, struct super_block, s_shrink);
>  
>  	/*
> -	 * Don't call trylock_super as it is a potential
> -	 * scalability bottleneck. The counts could get updated
> -	 * between super_cache_count and super_cache_scan anyway.
> -	 * Call to super_cache_count with shrinker_rwsem held
> -	 * ensures the safety of call to list_lru_shrink_count() and
> -	 * s_op->nr_cached_objects().
> +	 * We don't call trylock_super() here as it is a scalability bottleneck,
> +	 * so we're exposed to partial setup state. The shrinker rwsem does not
> +	 * protect filesystem operations backing list_lru_shrink_count() or
> +	 * s_op->nr_cached_objects(). Counts can change between
> +	 * super_cache_count and super_cache_scan, so we really don't need locks
> +	 * here.
> +	 *
> +	 * However, if we are currently mounting the superblock, the underlying
> +	 * filesystem might be in a state of partial construction and hence it
> +	 * is dangerous to access it.  trylock_super() uses a SB_BORN check to
> +	 * avoid this situation, so do the same here. The memory barrier is
> +	 * matched with the one in mount_fs() as we don't hold locks here.
>  	 */
> +	if (!(sb->s_flags & SB_BORN))
> +		return 0;
> +	smp_rmb();
> +
>  	if (sb->s_op && sb->s_op->nr_cached_objects)
>  		total_objects = sb->s_op->nr_cached_objects(sb, sc);
>  
> @@ -1272,6 +1282,14 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
>  	sb = root->d_sb;
>  	BUG_ON(!sb);
>  	WARN_ON(!sb->s_bdi);
> +
> +	/*
> +	 * Write barrier is for super_cache_count(). We place it before setting
> +	 * SB_BORN as the data dependency between the two functions is the
> +	 * superblock structure contents that we just set up, not the SB_BORN
> +	 * flag.
> +	 */
> +	smp_wmb();
>  	sb->s_flags |= SB_BORN;
>  
>  	error = security_sb_kern_mount(sb, flags, secdata);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index a523eaeb3f29..005386f1499e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1772,6 +1772,8 @@ xfs_fs_fill_super(
>   out_close_devices:
>  	xfs_close_devices(mp);
>   out_free_fsname:
> +	sb->s_fs_info = NULL;
> +	sb->s_op = NULL;
>  	xfs_free_fsname(mp);
>  	kfree(mp);
>   out:
> @@ -1798,6 +1800,9 @@ xfs_fs_put_super(
>  	xfs_destroy_percpu_counters(mp);
>  	xfs_destroy_mount_workqueues(mp);
>  	xfs_close_devices(mp);
> +
> +	sb->s_fs_info = NULL;
> +	sb->s_op = NULL;
>  	xfs_free_fsname(mp);
>  	kfree(mp);
>  }
> @@ -1817,6 +1822,12 @@ xfs_fs_nr_cached_objects(
>  	struct super_block	*sb,
>  	struct shrink_control	*sc)
>  {
> +	/*
> +	 * Don't do anything until the filesystem is fully set up, or in the
> +	 * process of being torn down due to a mount failure.
> +	 */
> +	if (!sb->s_fs_info)
> +		return 0;
>  	return xfs_reclaim_inodes_count(XFS_M(sb));
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] fs: don't scan the inode cache before SB_BORN is set
  2018-05-10  4:21 [PATCH v3] fs: don't scan the inode cache before SB_BORN is set Dave Chinner
  2018-05-10 16:39 ` Darrick J. Wong
@ 2018-05-10 19:09 ` Al Viro
  2018-05-10 23:55   ` Dave Chinner
  2018-05-11  1:20 ` [PATCH v4] " Dave Chinner
  2 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2018-05-10 19:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Thu, May 10, 2018 at 02:21:33PM +1000, Dave Chinner wrote:

> Setting sb->s_fs_info to NULL on xfs_mount setup failure only solves
> the use-after-free part of the problem - it doesn't solve the
> use-before-initialisation part. To solve that we need to check the
> SB_BORN flag in super_cache_count().
> 
> The SB_BORN flag is not set until ->fs_mount() completes
> successfully and trylock_super() won't succeed until it is set.
> Hence super_cache_scan() will not run until SB_BORN is set, so it
> makes sense to not allow super_cache_scan to run and enter the
> filesystem until it is set, too. This prevents the superblock
> shrinker from entering the filesystem while it is being set up and
> so avoids the use-before-initialisation issue.

I'm fine with the first part of that (fs/super.c, that is), but
I don't understand why do you need the xfs side of the patch with
that.  Confused...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] fs: don't scan the inode cache before SB_BORN is set
  2018-05-10 19:09 ` Al Viro
@ 2018-05-10 23:55   ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2018-05-10 23:55 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-xfs, linux-fsdevel

On Thu, May 10, 2018 at 08:09:16PM +0100, Al Viro wrote:
> On Thu, May 10, 2018 at 02:21:33PM +1000, Dave Chinner wrote:
> 
> > Setting sb->s_fs_info to NULL on xfs_mount setup failure only solves
> > the use-after-free part of the problem - it doesn't solve the
> > use-before-initialisation part. To solve that we need to check the
> > SB_BORN flag in super_cache_count().
> > 
> > The SB_BORN flag is not set until ->fs_mount() completes
> > successfully and trylock_super() won't succeed until it is set.
> > Hence super_cache_scan() will not run until SB_BORN is set, so it
> > makes sense to not allow super_cache_scan to run and enter the
> > filesystem until it is set, too. This prevents the superblock
> > shrinker from entering the filesystem while it is being set up and
> > so avoids the use-before-initialisation issue.
> 
> I'm fine with the first part of that (fs/super.c, that is), but
> I don't understand why do you need the xfs side of the patch with
> that.  Confused...

Well, the zeroing of s_fs_info is necessary to prevent
use-after-free on the ->put_super side when ->fill_super fails. The
shrinker counter check for a null s_fs_info is paranoia -
historically we've had so many problems over time with superblock
shrinkers being run when they shouldn't be, and I'm certain that
someone will get it wrong again in the future.

That said, I got no idea what I was thinking with the s_op change;
that should just go bang on a mount failure because
generic_shutdown_super() calls ops->put_super() unconditionally.
I'll respin and remove the sop changes, and replace it with an
actual check in xfs_fs_put_super() for a null s_fs_info....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4] fs: don't scan the inode cache before SB_BORN is set
  2018-05-10  4:21 [PATCH v3] fs: don't scan the inode cache before SB_BORN is set Dave Chinner
  2018-05-10 16:39 ` Darrick J. Wong
  2018-05-10 19:09 ` Al Viro
@ 2018-05-11  1:20 ` Dave Chinner
  2018-05-11  2:28   ` Al Viro
  2 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2018-05-11  1:20 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, Al Viro


From: Dave Chinner <dchinner@redhat.com>

We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.  It produces
an oops down this path during the failed mount:

  radix_tree_gang_lookup_tag+0xc4/0x130
  xfs_perag_get_tag+0x37/0xf0
  xfs_reclaim_inodes_count+0x32/0x40
  xfs_fs_nr_cached_objects+0x11/0x20
  super_cache_count+0x35/0xc0
  shrink_slab.part.66+0xb1/0x370
  shrink_node+0x7e/0x1a0
  try_to_free_pages+0x199/0x470
  __alloc_pages_slowpath+0x3a1/0xd20
  __alloc_pages_nodemask+0x1c3/0x200
  cache_grow_begin+0x20b/0x2e0
  fallback_alloc+0x160/0x200
  kmem_cache_alloc+0x111/0x4e0

The problem is that the superblock shrinker is running before the
filesystem structures it depends on have been fully set up. i.e.
the shrinker is registered in sget(), before ->fill_super() has been
called, and the shrinker can call into the filesystem before
fill_super() does it's setup work. Essentially we are exposed to
both use-after-free and use-before-initialisation bugs here.

To fix this, add a check for the SB_BORN flag in super_cache_count.
In general, this flag is not set until ->fs_mount() completes
successfully, so we know that it is set after the filesystem
setup has completed. This matches the trylock_super() behaviour
which will not let super_cache_scan() run if SB_BORN is not set, and
hence will not allow the superblock shrinker from entering the
filesystem while it is being set up or after it has failed setup
and is being torn down.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
Version 4:
- dropped all the XFS bits from the patch as they are unrelated to
  the shrinker running before the filesystem is fully set up.

Version 3:
- change the memory barriers to protect the superblock data, not
  the SB_BORN flag.

Version 2:
- convert to use SB_BORN, not SB_ACTIVE
- add memory barriers
- rework comment in super_cache_count()

---
 fs/super.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 122c402049a2..4b5b562176d0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -121,13 +121,23 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	/*
-	 * Don't call trylock_super as it is a potential
-	 * scalability bottleneck. The counts could get updated
-	 * between super_cache_count and super_cache_scan anyway.
-	 * Call to super_cache_count with shrinker_rwsem held
-	 * ensures the safety of call to list_lru_shrink_count() and
-	 * s_op->nr_cached_objects().
+	 * We don't call trylock_super() here as it is a scalability bottleneck,
+	 * so we're exposed to partial setup state. The shrinker rwsem does not
+	 * protect filesystem operations backing list_lru_shrink_count() or
+	 * s_op->nr_cached_objects(). Counts can change between
+	 * super_cache_count and super_cache_scan, so we really don't need locks
+	 * here.
+	 *
+	 * However, if we are currently mounting the superblock, the underlying
+	 * filesystem might be in a state of partial construction and hence it
+	 * is dangerous to access it.  trylock_super() uses a SB_BORN check to
+	 * avoid this situation, so do the same here. The memory barrier is
+	 * matched with the one in mount_fs() as we don't hold locks here.
 	 */
+	if (!(sb->s_flags & SB_BORN))
+		return 0;
+	smp_rmb();
+
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
@@ -1272,6 +1282,14 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
 	sb = root->d_sb;
 	BUG_ON(!sb);
 	WARN_ON(!sb->s_bdi);
+
+	/*
+	 * Write barrier is for super_cache_count(). We place it before setting
+	 * SB_BORN as the data dependency between the two functions is the
+	 * superblock structure contents that we just set up, not the SB_BORN
+	 * flag.
+	 */
+	smp_wmb();
 	sb->s_flags |= SB_BORN;
 
 	error = security_sb_kern_mount(sb, flags, secdata);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] fs: don't scan the inode cache before SB_BORN is set
  2018-05-11  1:20 ` [PATCH v4] " Dave Chinner
@ 2018-05-11  2:28   ` Al Viro
  2018-05-11  3:04     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2018-05-11  2:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Fri, May 11, 2018 at 11:20:57AM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> We recently had an oops reported on a 4.14 kernel in
> xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> and so the m_perag_tree lookup walked into lala land.  It produces
> an oops down this path during the failed mount:
> 
>   radix_tree_gang_lookup_tag+0xc4/0x130
>   xfs_perag_get_tag+0x37/0xf0
>   xfs_reclaim_inodes_count+0x32/0x40
>   xfs_fs_nr_cached_objects+0x11/0x20
>   super_cache_count+0x35/0xc0
>   shrink_slab.part.66+0xb1/0x370
>   shrink_node+0x7e/0x1a0
>   try_to_free_pages+0x199/0x470
>   __alloc_pages_slowpath+0x3a1/0xd20
>   __alloc_pages_nodemask+0x1c3/0x200
>   cache_grow_begin+0x20b/0x2e0
>   fallback_alloc+0x160/0x200
>   kmem_cache_alloc+0x111/0x4e0
> 
> The problem is that the superblock shrinker is running before the
> filesystem structures it depends on have been fully set up. i.e.
> the shrinker is registered in sget(), before ->fill_super() has been
> called, and the shrinker can call into the filesystem before
> fill_super() does it's setup work. Essentially we are exposed to
> both use-after-free and use-before-initialisation bugs here.
> 
> To fix this, add a check for the SB_BORN flag in super_cache_count.
> In general, this flag is not set until ->fs_mount() completes
> successfully, so we know that it is set after the filesystem
> setup has completed. This matches the trylock_super() behaviour
> which will not let super_cache_scan() run if SB_BORN is not set, and
> hence will not allow the superblock shrinker from entering the
> filesystem while it is being set up or after it has failed setup
> and is being torn down.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>

Umm...  Shouldn't that be Cc: stable, or is there something subtle I'm
missing here?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] fs: don't scan the inode cache before SB_BORN is set
  2018-05-11  2:28   ` Al Viro
@ 2018-05-11  3:04     ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2018-05-11  3:04 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-xfs, linux-fsdevel

On Fri, May 11, 2018 at 03:28:41AM +0100, Al Viro wrote:
> On Fri, May 11, 2018 at 11:20:57AM +1000, Dave Chinner wrote:
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We recently had an oops reported on a 4.14 kernel in
> > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> > and so the m_perag_tree lookup walked into lala land.  It produces
> > an oops down this path during the failed mount:
> > 
> >   radix_tree_gang_lookup_tag+0xc4/0x130
> >   xfs_perag_get_tag+0x37/0xf0
> >   xfs_reclaim_inodes_count+0x32/0x40
> >   xfs_fs_nr_cached_objects+0x11/0x20
> >   super_cache_count+0x35/0xc0
> >   shrink_slab.part.66+0xb1/0x370
> >   shrink_node+0x7e/0x1a0
> >   try_to_free_pages+0x199/0x470
> >   __alloc_pages_slowpath+0x3a1/0xd20
> >   __alloc_pages_nodemask+0x1c3/0x200
> >   cache_grow_begin+0x20b/0x2e0
> >   fallback_alloc+0x160/0x200
> >   kmem_cache_alloc+0x111/0x4e0
> > 
> > The problem is that the superblock shrinker is running before the
> > filesystem structures it depends on have been fully set up. i.e.
> > the shrinker is registered in sget(), before ->fill_super() has been
> > called, and the shrinker can call into the filesystem before
> > fill_super() does it's setup work. Essentially we are exposed to
> > both use-after-free and use-before-initialisation bugs here.
> > 
> > To fix this, add a check for the SB_BORN flag in super_cache_count.
> > In general, this flag is not set until ->fs_mount() completes
> > successfully, so we know that it is set after the filesystem
> > setup has completed. This matches the trylock_super() behaviour
> > which will not let super_cache_scan() run if SB_BORN is not set, and
> > hence will not allow the superblock shrinker from entering the
> > filesystem while it is being set up or after it has failed setup
> > and is being torn down.
> > 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> 
> Umm...  Shouldn't that be Cc: stable, or is there something subtle I'm
> missing here?

I don't test stable kernels and I know that test coverage of stable
kernels for filesystem changes is utterly abysmal. Hence I will not
push patches on users that are unlikely to be properly tested.

You can add a CC: <stable> when you commit it if you want, but I'm
not going to do that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-05-11  3:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-10  4:21 [PATCH v3] fs: don't scan the inode cache before SB_BORN is set Dave Chinner
2018-05-10 16:39 ` Darrick J. Wong
2018-05-10 19:09 ` Al Viro
2018-05-10 23:55   ` Dave Chinner
2018-05-11  1:20 ` [PATCH v4] " Dave Chinner
2018-05-11  2:28   ` Al Viro
2018-05-11  3:04     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).