LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>,
	"Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>,
	Vineeth Pillai <vpillai@digitalocean.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kelley Nielsen <kelleynnn@gmail.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Rik van Riel <riel@surriel.com>,
	Huang Ying <ying.huang@intel.com>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: shmem_recalc_inode: unable to handle kernel NULL pointer dereference
Date: Sun, 7 Apr 2019 23:05:50 -0700 (PDT)
Message-ID: <alpine.LSU.2.11.1904072206030.1769@eggly.anvils> (raw)
In-Reply-To: <56deb587-8cd6-317a-520f-209207468c55@yandex-team.ru>

On Fri, 5 Apr 2019, Konstantin Khlebnikov wrote:
> On 05.04.2019 5:12, Hugh Dickins wrote:
> > Hi Alex, could you please give the patch below a try? It fixes a
> > problem, but I'm not sure that it's your problem - please let us know.
> > 
> > I've not yet written up the commit description, and this should end up
> > as 4/4 in a series fixing several new swapoff issues: I'll wait to post
> > the finished series until heard back from you.
> > 
> > I did first try following the suggestion Konstantin had made back then,
> > for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active).
> > 
> > But it turned out to be difficult to get right in shmem_unuse(), because
> > of the way that relies on the inode as a cursor in the list - problem
> > when you've acquired an s_active reference, but fail to acquire inode
> > reference, and cannot safely release the s_active reference while still
> > holding the swaplist mutex.
> > 
> > If VFS offered an isgrab(inode), like igrab() but acquiring s_active
> > reference while holding i_lock, that would drop very easily into the
> > current shmem_unuse() as a replacement there for igrab(). But the rest
> > of the world has managed without that for years, so I'm disinclined to
> > add it just for this. And the patch below seems good enough without it.
> > 
> > Thanks,
> > Hugh
> > 
> > ---
> > 
> >   include/linux/shmem_fs.h |    1 +
> >   mm/shmem.c               |   39 ++++++++++++++++++---------------------
> >   2 files changed, 19 insertions(+), 21 deletions(-)
> > 
> > --- 5.1-rc3/include/linux/shmem_fs.h	2019-03-17 16:18:15.181820820 -0700
> > +++ linux/include/linux/shmem_fs.h	2019-04-04 16:18:08.193512968 -0700
> > @@ -21,6 +21,7 @@ struct shmem_inode_info {
> >   	struct list_head	swaplist;	/* chain of maybes on swap */
> >   	struct shared_policy	policy;		/* NUMA memory alloc policy
> > */
> >   	struct simple_xattrs	xattrs;		/* list of xattrs */
> > +	atomic_t		stop_eviction;	/* hold when working on inode
> > */
> >   	struct inode		vfs_inode;
> >   };
> >   --- 5.1-rc3/mm/shmem.c	2019-03-17 16:18:15.701823872 -0700
> > +++ linux/mm/shmem.c	2019-04-04 16:18:08.193512968 -0700
> > @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
> >   			}
> >   			spin_unlock(&sbinfo->shrinklist_lock);
> >   		}
> > -		if (!list_empty(&info->swaplist)) {
> > +		while (!list_empty(&info->swaplist)) {
> > +			/* Wait while shmem_unuse() is scanning this inode...
> > */
> > +			wait_var_event(&info->stop_eviction,
> > +				       !atomic_read(&info->stop_eviction));
> >   			mutex_lock(&shmem_swaplist_mutex);
> >   			list_del_init(&info->swaplist);
> > +			/* ...but beware of the race if we peeked too early
> > */
> > +			if (!atomic_read(&info->stop_eviction))
> > +				list_del_init(&info->swaplist);
> >   			mutex_unlock(&shmem_swaplist_mutex);
> >   		}
> >   	}
> > @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
> >   		unsigned long *fs_pages_to_unuse)
> >   {
> >   	struct shmem_inode_info *info, *next;
> > -	struct inode *inode;
> > -	struct inode *prev_inode = NULL;
> >   	int error = 0;
> >     	if (list_empty(&shmem_swaplist))
> >   		return 0;
> >     	mutex_lock(&shmem_swaplist_mutex);
> > -
> > -	/*
> > -	 * The extra refcount on the inode is necessary to safely dereference
> > -	 * p->next after re-acquiring the lock. New shmem inodes with swap
> > -	 * get added to the end of the list and we will scan them all.
> > -	 */
> >   	list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
> >   		if (!info->swapped) {
> >   			list_del_init(&info->swaplist);
> >   			continue;
> >   		}
> > -
> > -		inode = igrab(&info->vfs_inode);
> > -		if (!inode)
> > -			continue;
> > -
> > +		/*
> > +		 * Drop the swaplist mutex while searching the inode for
> > swap;
> > +		 * but before doing so, make sure shmem_evict_inode() will
> > not
> > +		 * remove placeholder inode from swaplist, nor let it be
> > freed
> > +		 * (igrab() would protect from unlink, but not from unmount).
> > +		 */
> > +		atomic_inc(&info->stop_eviction);
> >   		mutex_unlock(&shmem_swaplist_mutex);
> > -		if (prev_inode)
> > -			iput(prev_inode);
> > -		prev_inode = inode;
> This seems too ad hoc solution.

I see what you mean by "ad hoc", but disagree with "too" ad hoc:
it's an appropriate solution, and a general one - I didn't invent it
for this, but for the huge tmpfs recoveries work items four years ago;
just changed the name from "info->recoveries" to "info->stop_eviction"
to let it be generalized to this swapoff case.

I prefer mine, since it simplifies shmem_unuse() (no igrab!), and has
the nice (but admittedly not essential) property of letting swapoff
proceed without delay and without unnecessary locking on unmounting
filesystems and evicting inodes.  (Would I prefer to use the s_umount
technique for my recoveries case? I think not.)

But yours should work too, with a slight change - see comments below,
where I've inlined yours. I'd better get on and post my four fixes
tomorrow, whether or not they fix Alex's case; then if people prefer
yours to my 4/4, yours can be swapped in instead.

> shmem: fix race between shmem_unuse and umount
> 
> From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> 
> Function shmem_unuse could race with generic_shutdown_super.
> Inode reference is not enough for preventing umount and freeing superblock.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  mm/shmem.c |   24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b3db3779a30a..2018a9a96bb7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1218,6 +1218,10 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type,
>  	return ret;
>  }
>  
> +static void shmem_synchronize_umount(struct super_block *sb, void *arg)
> +{
> +}
> +

I think this can go away, see below.

>  /*
>   * Read all the shared memory data that resides in the swap
>   * device 'type' back into memory, so the swap device can be
> @@ -1229,6 +1233,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
>  	struct shmem_inode_info *info, *next;
>  	struct inode *inode;
>  	struct inode *prev_inode = NULL;
> +	struct super_block *sb;
>  	int error = 0;
>  
>  	if (list_empty(&shmem_swaplist))
> @@ -1247,9 +1252,22 @@ int shmem_unuse(unsigned int type, bool frontswap,
>  			continue;
>  		}
>  
> +		/*
> +		 * Lock superblock to prevent umount and freeing it under us.
> +		 * If umount in progress it will free swap enties.
> +		 *
> +		 * Must be done before grabbing inode reference, otherwise
> +		 * generic_shutdown_super() will complain about busy inodes.
> +		 */
> +		sb = info->vfs_inode.i_sb;
> +		if (!trylock_super(sb))

Right, trylock important there.

> +			continue;
> +
>  		inode = igrab(&info->vfs_inode);
> -		if (!inode)
> +		if (!inode) {
> +			up_read(&sb->s_umount);

Yes, that indeed avoids the difficulty I had with when to call
deactivate_super(), that put me off trying to use s_active.

>  			continue;
> +		}
>  
>  		mutex_unlock(&shmem_swaplist_mutex);
>  		if (prev_inode)
> @@ -1258,6 +1276,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
>  
>  		error = shmem_unuse_inode(inode, type, frontswap,
>  					  fs_pages_to_unuse);
> +		up_read(&sb->s_umount);

No, not here. I think you have to note prev_sb, and then only
up_read(&prev_sb->s_umount) after each iput(prev_inode): otherwise
there's still a risk of "Self-destruct in 5 seconds", isn't there?

>  		cond_resched();
>  
>  		mutex_lock(&shmem_swaplist_mutex);
> @@ -1272,6 +1291,9 @@ int shmem_unuse(unsigned int type, bool frontswap,
>  	if (prev_inode)
>  		iput(prev_inode);
>  
> +	/* Wait for umounts, this grabs s_umount for each superblock. */
> +	iterate_supers_type(&shmem_fs_type, shmem_synchronize_umount, NULL);
> +

I guess that's an attempt to compensate for the somewhat unsatisfactory
trylock above (bearing in mind the SWAP_UNUSE_MAX_TRIES 3, but I remove
that in my 2/4). Nice idea, and if it had the effect of never needing to
retry shmem_unuse(), I'd say yes; but since you're still passing over
un-igrab()-able inodes without an equivalent synchronization, I think
this odd iterate_supers_type() just delays swapoff without buying any
guarantee: better just deleted to keep your patch simpler.

>  	return error;
>  }
>  

Thanks,
Hugh

  reply index

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-24 15:30 Alex Xu (Hello71)
2019-03-25 22:08 ` Vineeth Pillai
2019-03-31 16:15   ` Alex Xu (Hello71)
2019-03-31 19:21     ` Hugh Dickins
2019-04-03  0:30       ` Hugh Dickins
2019-04-05  2:12         ` Hugh Dickins
2019-04-05  9:41           ` Konstantin Khlebnikov
2019-04-08  6:05             ` Hugh Dickins [this message]
2019-04-08  7:19               ` Konstantin Khlebnikov
2019-04-08 17:26                 ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.11.1904072206030.1769@eggly.anvils \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex_y_xu@yahoo.ca \
    --cc=kelleynnn@gmail.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vpillai@digitalocean.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git