From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Al Viro <viro@ZenIV.linux.org.uk>
Subject: Re: [PATCH v3] fs: don't scan the inode cache before SB_BORN is set
Date: Thu, 10 May 2018 09:39:14 -0700 [thread overview]
Message-ID: <20180510163914.GF11261@magnolia> (raw)
In-Reply-To: <20180510042132.GS23861@dastard>
On Thu, May 10, 2018 at 02:21:33PM +1000, Dave Chinner wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> We recently had an oops reported on a 4.14 kernel in
> xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
> and so the m_perag_tree lookup walked into lala land.
>
> We found a mount in a failed state, blocked on the shrinker rwsem
> here:
>
> mount_bdev()
> deactivate_locked_super()
> unregister_shrinker()
>
> Essentially, the machine was under memory pressure when the mount
> was being run, xfs_fs_fill_super() failed after allocating the
> xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and
> freed the xfs_mount, but the sb->s_fs_info field still pointed to
> the freed memory. Hence when the superblock shrinker then ran
> it fell off the bad pointer.
>
> However, we also saw another manifestation of the same problem - the
> shrinker can fall off a bad pointer if it runs before the superblock
> is fully set up - a use before initialisation problem. This
> typically crashed somewhere in the radix tree manipulations in
> this path:
>
> radix_tree_gang_lookup_tag+0xc4/0x130
> xfs_perag_get_tag+0x37/0xf0
> xfs_reclaim_inodes_count+0x32/0x40
> xfs_fs_nr_cached_objects+0x11/0x20
> super_cache_count+0x35/0xc0
> shrink_slab.part.66+0xb1/0x370
> shrink_node+0x7e/0x1a0
> try_to_free_pages+0x199/0x470
> __alloc_pages_slowpath+0x3a1/0xd20
> __alloc_pages_nodemask+0x1c3/0x200
> cache_grow_begin+0x20b/0x2e0
> fallback_alloc+0x160/0x200
> kmem_cache_alloc+0x111/0x4e0
>
> The underlying problem is that the superblock shrinker is running
> before the filesystem structures it depends on have been fully set
> up. i.e. the shrinker is registered in sget(), before
> ->fill_super() has been called, and the shrinker can call into the
> filesystem before fill_super() does it's setup work.
>
> Setting sb->s_fs_info to NULL on xfs_mount setup failure only solves
> the use-after-free part of the problem - it doesn't solve the
> use-before-initialisation part. To solve that we need to check the
> SB_BORN flag in super_cache_count().
>
> The SB_BORN flag is not set until ->fs_mount() completes
> successfully and trylock_super() won't succeed until it is set.
> Hence super_cache_scan() will not run until SB_BORN is set, so it
> makes sense to not allow super_cache_scan to run and enter the
> filesystem until it is set, too. This prevents the superblock
> shrinker from entering the filesystem while it is being set up and
> so avoids the use-before-initialisation issue.
>
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Looks ok, will give it a spin,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
--D
> ---
> Version 3:
> - change the memory barriers to protect the superblock data, not
> the SB_BORN flag.
>
> Version 2:
> - convert to use SB_BORN, not SB_ACTIVE
> - add memory barriers
> - rework comment in super_cache_count()
>
> ---
> fs/super.c | 30 ++++++++++++++++++++++++------
> fs/xfs/xfs_super.c | 11 +++++++++++
> 2 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 122c402049a2..4b5b562176d0 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -121,13 +121,23 @@ static unsigned long super_cache_count(struct shrinker *shrink,
> sb = container_of(shrink, struct super_block, s_shrink);
>
> /*
> - * Don't call trylock_super as it is a potential
> - * scalability bottleneck. The counts could get updated
> - * between super_cache_count and super_cache_scan anyway.
> - * Call to super_cache_count with shrinker_rwsem held
> - * ensures the safety of call to list_lru_shrink_count() and
> - * s_op->nr_cached_objects().
> + * We don't call trylock_super() here as it is a scalability bottleneck,
> + * so we're exposed to partial setup state. The shrinker rwsem does not
> + * protect filesystem operations backing list_lru_shrink_count() or
> + * s_op->nr_cached_objects(). Counts can change between
> + * super_cache_count and super_cache_scan, so we really don't need locks
> + * here.
> + *
> + * However, if we are currently mounting the superblock, the underlying
> + * filesystem might be in a state of partial construction and hence it
> + * is dangerous to access it. trylock_super() uses a SB_BORN check to
> + * avoid this situation, so do the same here. The memory barrier is
> + * matched with the one in mount_fs() as we don't hold locks here.
> */
> + if (!(sb->s_flags & SB_BORN))
> + return 0;
> + smp_rmb();
> +
> if (sb->s_op && sb->s_op->nr_cached_objects)
> total_objects = sb->s_op->nr_cached_objects(sb, sc);
>
> @@ -1272,6 +1282,14 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
> sb = root->d_sb;
> BUG_ON(!sb);
> WARN_ON(!sb->s_bdi);
> +
> + /*
> + * Write barrier is for super_cache_count(). We place it before setting
> + * SB_BORN as the data dependency between the two functions is the
> + * superblock structure contents that we just set up, not the SB_BORN
> + * flag.
> + */
> + smp_wmb();
> sb->s_flags |= SB_BORN;
>
> error = security_sb_kern_mount(sb, flags, secdata);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index a523eaeb3f29..005386f1499e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1772,6 +1772,8 @@ xfs_fs_fill_super(
> out_close_devices:
> xfs_close_devices(mp);
> out_free_fsname:
> + sb->s_fs_info = NULL;
> + sb->s_op = NULL;
> xfs_free_fsname(mp);
> kfree(mp);
> out:
> @@ -1798,6 +1800,9 @@ xfs_fs_put_super(
> xfs_destroy_percpu_counters(mp);
> xfs_destroy_mount_workqueues(mp);
> xfs_close_devices(mp);
> +
> + sb->s_fs_info = NULL;
> + sb->s_op = NULL;
> xfs_free_fsname(mp);
> kfree(mp);
> }
> @@ -1817,6 +1822,12 @@ xfs_fs_nr_cached_objects(
> struct super_block *sb,
> struct shrink_control *sc)
> {
> + /*
> + * Don't do anything until the filesystem is fully set up, or in the
> + * process of being torn down due to a mount failure.
> + */
> + if (!sb->s_fs_info)
> + return 0;
> return xfs_reclaim_inodes_count(XFS_M(sb));
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2018-05-10 16:39 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-10 4:21 [PATCH v3] fs: don't scan the inode cache before SB_BORN is set Dave Chinner
2018-05-10 16:39 ` Darrick J. Wong [this message]
2018-05-10 19:09 ` Al Viro
2018-05-10 23:55 ` Dave Chinner
2018-05-11 1:20 ` [PATCH v4] " Dave Chinner
2018-05-11 2:28 ` Al Viro
2018-05-11 3:04 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180510163914.GF11261@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).