All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>,
	"Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>,
	Vineeth Pillai <vpillai@digitalocean.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kelley Nielsen <kelleynnn@gmail.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Rik van Riel <riel@surriel.com>,
	Huang Ying <ying.huang@intel.com>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: shmem_recalc_inode: unable to handle kernel NULL pointer dereference
Date: Sun, 7 Apr 2019 23:05:50 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.1904072206030.1769@eggly.anvils> (raw)
In-Reply-To: <56deb587-8cd6-317a-520f-209207468c55@yandex-team.ru>

On Fri, 5 Apr 2019, Konstantin Khlebnikov wrote:
> On 05.04.2019 5:12, Hugh Dickins wrote:
> > Hi Alex, could you please give the patch below a try? It fixes a
> > problem, but I'm not sure that it's your problem - please let us know.
> > 
> > I've not yet written up the commit description, and this should end up
> > as 4/4 in a series fixing several new swapoff issues: I'll wait to post
> > the finished series until heard back from you.
> > 
> > I did first try following the suggestion Konstantin had made back then,
> > for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active).
> > 
> > But it turned out to be difficult to get right in shmem_unuse(), because
> > of the way that relies on the inode as a cursor in the list - problem
> > when you've acquired an s_active reference, but fail to acquire inode
> > reference, and cannot safely release the s_active reference while still
> > holding the swaplist mutex.
> > 
> > If VFS offered an isgrab(inode), like igrab() but acquiring s_active
> > reference while holding i_lock, that would drop very easily into the
> > current shmem_unuse() as a replacement there for igrab(). But the rest
> > of the world has managed without that for years, so I'm disinclined to
> > add it just for this. And the patch below seems good enough without it.
> > 
> > Thanks,
> > Hugh
> > 
> > ---
> > 
> >   include/linux/shmem_fs.h |    1 +
> >   mm/shmem.c               |   39 ++++++++++++++++++---------------------
> >   2 files changed, 19 insertions(+), 21 deletions(-)
> > 
> > --- 5.1-rc3/include/linux/shmem_fs.h	2019-03-17 16:18:15.181820820 -0700
> > +++ linux/include/linux/shmem_fs.h	2019-04-04 16:18:08.193512968 -0700
> > @@ -21,6 +21,7 @@ struct shmem_inode_info {
> >   	struct list_head	swaplist;	/* chain of maybes on swap */
> >   	struct shared_policy	policy;		/* NUMA memory alloc policy
> > */
> >   	struct simple_xattrs	xattrs;		/* list of xattrs */
> > +	atomic_t		stop_eviction;	/* hold when working on inode
> > */
> >   	struct inode		vfs_inode;
> >   };
> >   --- 5.1-rc3/mm/shmem.c	2019-03-17 16:18:15.701823872 -0700
> > +++ linux/mm/shmem.c	2019-04-04 16:18:08.193512968 -0700
> > @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
> >   			}
> >   			spin_unlock(&sbinfo->shrinklist_lock);
> >   		}
> > -		if (!list_empty(&info->swaplist)) {
> > +		while (!list_empty(&info->swaplist)) {
> > +			/* Wait while shmem_unuse() is scanning this inode...
> > */
> > +			wait_var_event(&info->stop_eviction,
> > +				       !atomic_read(&info->stop_eviction));
> >   			mutex_lock(&shmem_swaplist_mutex);
> >   			list_del_init(&info->swaplist);
> > +			/* ...but beware of the race if we peeked too early
> > */
> > +			if (!atomic_read(&info->stop_eviction))
> > +				list_del_init(&info->swaplist);
> >   			mutex_unlock(&shmem_swaplist_mutex);
> >   		}
> >   	}
> > @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
> >   		unsigned long *fs_pages_to_unuse)
> >   {
> >   	struct shmem_inode_info *info, *next;
> > -	struct inode *inode;
> > -	struct inode *prev_inode = NULL;
> >   	int error = 0;
> >     	if (list_empty(&shmem_swaplist))
> >   		return 0;
> >     	mutex_lock(&shmem_swaplist_mutex);
> > -
> > -	/*
> > -	 * The extra refcount on the inode is necessary to safely dereference
> > -	 * p->next after re-acquiring the lock. New shmem inodes with swap
> > -	 * get added to the end of the list and we will scan them all.
> > -	 */
> >   	list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
> >   		if (!info->swapped) {
> >   			list_del_init(&info->swaplist);
> >   			continue;
> >   		}
> > -
> > -		inode = igrab(&info->vfs_inode);
> > -		if (!inode)
> > -			continue;
> > -
> > +		/*
> > +		 * Drop the swaplist mutex while searching the inode for
> > swap;
> > +		 * but before doing so, make sure shmem_evict_inode() will
> > not
> > +		 * remove placeholder inode from swaplist, nor let it be
> > freed
> > +		 * (igrab() would protect from unlink, but not from unmount).
> > +		 */
> > +		atomic_inc(&info->stop_eviction);
> >   		mutex_unlock(&shmem_swaplist_mutex);
> > -		if (prev_inode)
> > -			iput(prev_inode);
> > -		prev_inode = inode;
> This seems too ad hoc solution.

I see what you mean by "ad hoc", but disagree with "too" ad hoc:
it's an appropriate solution, and a general one - I didn't invent it
for this, but for the huge tmpfs recoveries work items four years ago;
just changed the name from "info->recoveries" to "info->stop_eviction"
to let it be generalized to this swapoff case.

I prefer mine, since it simplifies shmem_unuse() (no igrab!), and has
the nice (but admittedly not essential) property of letting swapoff
proceed without delay and without unnecessary locking on unmounting
filesystems and evicting inodes.  (Would I prefer to use the s_umount
technique for my recoveries case? I think not.)

But yours should work too, with a slight change - see comments below,
where I've inlined yours. I'd better get on and post my four fixes
tomorrow, whether or not they fix Alex's case; then if people prefer
yours to my 4/4, yours can be swapped in instead.

> shmem: fix race between shmem_unuse and umount
> 
> From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> 
> Function shmem_unuse could race with generic_shutdown_super.
> Inode reference is not enough for preventing umount and freeing superblock.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  mm/shmem.c |   24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b3db3779a30a..2018a9a96bb7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1218,6 +1218,10 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type,
>  	return ret;
>  }
>  
> +static void shmem_synchronize_umount(struct super_block *sb, void *arg)
> +{
> +}
> +

I think this can go away, see below.

>  /*
>   * Read all the shared memory data that resides in the swap
>   * device 'type' back into memory, so the swap device can be
> @@ -1229,6 +1233,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
>  	struct shmem_inode_info *info, *next;
>  	struct inode *inode;
>  	struct inode *prev_inode = NULL;
> +	struct super_block *sb;
>  	int error = 0;
>  
>  	if (list_empty(&shmem_swaplist))
> @@ -1247,9 +1252,22 @@ int shmem_unuse(unsigned int type, bool frontswap,
>  			continue;
>  		}
>  
> +		/*
> +		 * Lock superblock to prevent umount and freeing it under us.
> +		 * If umount in progress it will free swap enties.
> +		 *
> +		 * Must be done before grabbing inode reference, otherwise
> +		 * generic_shutdown_super() will complain about busy inodes.
> +		 */
> +		sb = info->vfs_inode.i_sb;
> +		if (!trylock_super(sb))

Right, trylock important there.

> +			continue;
> +
>  		inode = igrab(&info->vfs_inode);
> -		if (!inode)
> +		if (!inode) {
> +			up_read(&sb->s_umount);

Yes, that indeed avoids the difficulty I had with when to call
deactivate_super(), that put me off trying to use s_active.

>  			continue;
> +		}
>  
>  		mutex_unlock(&shmem_swaplist_mutex);
>  		if (prev_inode)
> @@ -1258,6 +1276,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
>  
>  		error = shmem_unuse_inode(inode, type, frontswap,
>  					  fs_pages_to_unuse);
> +		up_read(&sb->s_umount);

No, not here. I think you have to note prev_sb, and then only
up_read(&prev_sb->s_umount) after each iput(prev_inode): otherwise
there's still a risk of "Self-destruct in 5 seconds", isn't there?

>  		cond_resched();
>  
>  		mutex_lock(&shmem_swaplist_mutex);
> @@ -1272,6 +1291,9 @@ int shmem_unuse(unsigned int type, bool frontswap,
>  	if (prev_inode)
>  		iput(prev_inode);
>  
> +	/* Wait for umounts, this grabs s_umount for each superblock. */
> +	iterate_supers_type(&shmem_fs_type, shmem_synchronize_umount, NULL);
> +

I guess that's an attempt to compensate for the somewhat unsatisfactory
trylock above (bearing in mind the SWAP_UNUSE_MAX_TRIES 3, but I remove
that in my 2/4). Nice idea, and if it had the effect of never needing to
retry shmem_unuse(), I'd say yes; but since you're still passing over
un-igrab()-able inodes without an equivalent synchronization, I think
this odd iterate_supers_type() just delays swapoff without buying any
guarantee: better just deleted to keep your patch simpler.

>  	return error;
>  }
>  

Thanks,
Hugh

  reply	other threads:[~2019-04-08  6:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-24 15:30 shmem_recalc_inode: unable to handle kernel NULL pointer dereference Alex Xu (Hello71)
2019-03-25 22:08 ` Vineeth Pillai
2019-03-25 22:08   ` Vineeth Pillai
2019-03-31 16:15   ` Alex Xu (Hello71)
2019-03-31 19:21     ` Hugh Dickins
2019-03-31 19:21       ` Hugh Dickins
2019-04-03  0:30       ` Hugh Dickins
2019-04-03  0:30         ` Hugh Dickins
2019-04-05  2:12         ` Hugh Dickins
2019-04-05  2:12           ` Hugh Dickins
2019-04-05  9:41           ` Konstantin Khlebnikov
2019-04-08  6:05             ` Hugh Dickins [this message]
2019-04-08  6:05               ` Hugh Dickins
2019-04-08  7:19               ` Konstantin Khlebnikov
2019-04-08 17:26                 ` Hugh Dickins
2019-04-08 17:26                   ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.11.1904072206030.1769@eggly.anvils \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex_y_xu@yahoo.ca \
    --cc=kelleynnn@gmail.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vpillai@digitalocean.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.