Re: shmem_recalc_inode: unable to handle kernel NULL pointer dereference

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: Hugh Dickins <hughd@google.com>,
	"Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
Cc: Vineeth Pillai <vpillai@digitalocean.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kelley Nielsen <kelleynnn@gmail.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Rik van Riel <riel@surriel.com>,
	Huang Ying <ying.huang@intel.com>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: shmem_recalc_inode: unable to handle kernel NULL pointer dereference
Date: Fri, 5 Apr 2019 12:41:15 +0300	[thread overview]
Message-ID: <56deb587-8cd6-317a-520f-209207468c55@yandex-team.ru> (raw)
In-Reply-To: <alpine.LSU.2.11.1904041836030.25100@eggly.anvils>

[-- Attachment #1: Type: text/plain, Size: 9044 bytes --]

On 05.04.2019 5:12, Hugh Dickins wrote:
> On Tue, 2 Apr 2019, Hugh Dickins wrote:
>> On Sun, 31 Mar 2019, Hugh Dickins wrote:
>>> On Sun, 31 Mar 2019, Alex Xu (Hello71) wrote:
>>>> Excerpts from Vineeth Pillai's message of March 25, 2019 6:08 pm:
>>>>> On Sun, Mar 24, 2019 at 11:30 AM Alex Xu (Hello71) <alex_y_xu@yahoo.ca> wrote:
>>>>>>
>>>>>> I get this BUG in 5.1-rc1 sometimes when powering off the machine. I
>>>>>> suspect my setup erroneously executes two swapoff+cryptsetup close
>>>>>> operations simultaneously, so a race condition is triggered.
>>>>>>
>>>>>> I am using a single swap on a plain dm-crypt device on a MBR partition
>>>>>> on a SATA drive.
>>>>>>
>>>>>> I think the problem is probably related to
>>>>>> b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people.
>>>>>>
>>>>> Could you please provide more information on this - stack trace, dmesg etc?
>>>>> Is it easily reproducible? If yes, please detail the steps so that I
>>>>> can try it inhouse.
>>>>>
>>>>> Thanks,
>>>>> Vineeth
>>>>>
>>>>
>>>> Some info from the BUG entry (I didn't bother to type it all,
>>>> low-quality image available upon request):
>>>>
>>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>>>> #PF error: [normal kernel read fault]
>>>> PGD 0 P4D 0
>>>> Oops: 0000 [#1] SMP
>>>> CPU: 0 Comm: swapoff Not tainted 5.1.0-rc1+ #2
>>>> RIP: 0010:shmem_recalc_inode+0x41/0x90
>>>>
>>>> Call Trace:
>>>> ? shmem_undo_range
>>>> ? rb_erase_cached
>>>> ? set_next_entity
>>>> ? __inode_wait_for_writeback
>>>> ? shmem_truncate_range
>>>> ? shmem_evict_inode
>>>> ? evict
>>>> ? shmem_unuse
>>>> ? try_to_unuse
>>>> ? swapcache_free_entries
>>>> ? _cond_resched
>>>> ? __se_sys_swapoff
>>>> ? do_syscall_64
>>>> ? entry_SYSCALL_64_after_hwframe
>>>>
>>>> As I said, it only occurs occasionally on shutdown. I think it is a safe
>>>> guess that it can only occur when the swap is not empty, but possibly
>>>> other conditions are necessary, so I will test further.
>>>
>>> Thanks for the update, Alex. I'm looking into a couple of bugs with the
>>> 5.1-rc swapoff, but this one doesn't look like anything I know so far.
>>> shmem_recalc_inode() is a surprising place to crash: it's as if the
>>> igrab() in shmem_unuse() were not working.
>>>
>>> Yes, please do send Vineeth and me (or the lists) your low-quality image,
>>> in case we can extract any more info from it; and also please the
>>> disassembly of your kernel's shmem_recalc_inode(), so we can be sure of
>>> exactly what it's crashing on (though I expect that will leave me as
>>> puzzled as before).
>>>
>>> If you want to experiment with one of my fixes, not yet written up and
>>> posted, just try changing SWAP_UNUSE_MAX_TRIES in mm/swapfile.c from
>>> 3 to INT_MAX: I don't see how that issue could manifest as crashing in
>>> shmem_recalc_inode(), but I may just be too stupid to see it.
>>
>> Thanks for the image and disassembly you sent: which showed that the
>> ffffffff81117351:       48 83 3f 00             cmpq   $0x0,(%rdi)
>> you are crashing on, is the "if (sbinfo->max_blocks)" in the inlined
>> shmem_inode_unacct_blocks(): inode->i_sb->s_fs_info is NULL, which is
>> something that shmem_put_super() does.
>>
>> Eight-year-old memories stirred: I knew when looking at Vineeth's patch,
>> that I ought to look back through the history of mm/shmem.c, to check
>> some points that Konstantin Khlebnikov had made years ago, that
>> surprised me then and were in danger of surprising us again with this
>> rework. But I failed to do so: thank you Alex, for reporting this bug
>> and pointing us back there.
>>
>> igrab() protects from eviction but does not protect from unmounting.
>> I bet that is what you are hitting, though I've not even read through
>> 2.6.39's 778dd893ae785 ("tmpfs: fix race between umount and swapoff")
>> again yet, and not begun to think of the fix for it this time around;
>> but wanted to let you know that this bug is now (probably) identified.
> 
> Hi Alex, could you please give the patch below a try? It fixes a
> problem, but I'm not sure that it's your problem - please let us know.
> 
> I've not yet written up the commit description, and this should end up
> as 4/4 in a series fixing several new swapoff issues: I'll wait to post
> the finished series until heard back from you.
> 
> I did first try following the suggestion Konstantin had made back then,
> for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active).
> 
> But it turned out to be difficult to get right in shmem_unuse(), because
> of the way that relies on the inode as a cursor in the list - problem
> when you've acquired an s_active reference, but fail to acquire inode
> reference, and cannot safely release the s_active reference while still
> holding the swaplist mutex.
> 
> If VFS offered an isgrab(inode), like igrab() but acquiring s_active
> reference while holding i_lock, that would drop very easily into the
> current shmem_unuse() as a replacement there for igrab(). But the rest
> of the world has managed without that for years, so I'm disinclined to
> add it just for this. And the patch below seems good enough without it.
> 
> Thanks,
> Hugh
> 
> ---
> 
>   include/linux/shmem_fs.h |    1 +
>   mm/shmem.c               |   39 ++++++++++++++++++---------------------
>   2 files changed, 19 insertions(+), 21 deletions(-)
> 
> --- 5.1-rc3/include/linux/shmem_fs.h	2019-03-17 16:18:15.181820820 -0700
> +++ linux/include/linux/shmem_fs.h	2019-04-04 16:18:08.193512968 -0700
> @@ -21,6 +21,7 @@ struct shmem_inode_info {
>   	struct list_head	swaplist;	/* chain of maybes on swap */
>   	struct shared_policy	policy;		/* NUMA memory alloc policy */
>   	struct simple_xattrs	xattrs;		/* list of xattrs */
> +	atomic_t		stop_eviction;	/* hold when working on inode */
>   	struct inode		vfs_inode;
>   };
>   
> --- 5.1-rc3/mm/shmem.c	2019-03-17 16:18:15.701823872 -0700
> +++ linux/mm/shmem.c	2019-04-04 16:18:08.193512968 -0700
> @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
>   			}
>   			spin_unlock(&sbinfo->shrinklist_lock);
>   		}
> -		if (!list_empty(&info->swaplist)) {
> +		while (!list_empty(&info->swaplist)) {
> +			/* Wait while shmem_unuse() is scanning this inode... */
> +			wait_var_event(&info->stop_eviction,
> +				       !atomic_read(&info->stop_eviction));
>   			mutex_lock(&shmem_swaplist_mutex);
>   			list_del_init(&info->swaplist);
> +			/* ...but beware of the race if we peeked too early */
> +			if (!atomic_read(&info->stop_eviction))
> +				list_del_init(&info->swaplist);
>   			mutex_unlock(&shmem_swaplist_mutex);
>   		}
>   	}
> @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
>   		unsigned long *fs_pages_to_unuse)
>   {
>   	struct shmem_inode_info *info, *next;
> -	struct inode *inode;
> -	struct inode *prev_inode = NULL;
>   	int error = 0;
>   
>   	if (list_empty(&shmem_swaplist))
>   		return 0;
>   
>   	mutex_lock(&shmem_swaplist_mutex);
> -
> -	/*
> -	 * The extra refcount on the inode is necessary to safely dereference
> -	 * p->next after re-acquiring the lock. New shmem inodes with swap
> -	 * get added to the end of the list and we will scan them all.
> -	 */
>   	list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
>   		if (!info->swapped) {
>   			list_del_init(&info->swaplist);
>   			continue;
>   		}
> -
> -		inode = igrab(&info->vfs_inode);
> -		if (!inode)
> -			continue;
> -
> +		/*
> +		 * Drop the swaplist mutex while searching the inode for swap;
> +		 * but before doing so, make sure shmem_evict_inode() will not
> +		 * remove placeholder inode from swaplist, nor let it be freed
> +		 * (igrab() would protect from unlink, but not from unmount).
> +		 */
> +		atomic_inc(&info->stop_eviction);
>   		mutex_unlock(&shmem_swaplist_mutex);
> -		if (prev_inode)
> -			iput(prev_inode);
> -		prev_inode = inode;
This seems too ad hoc solution.

Superblock could be protected with s_umount,
in same way as writeback pins it in __writeback_inodes_wb()

Please see (completely untested) patch in attachment.

>   
> -		error = shmem_unuse_inode(inode, type, frontswap,
> +		error = shmem_unuse_inode(&info->vfs_inode, type, frontswap,
>   					  fs_pages_to_unuse);
>   		cond_resched();
>   
> @@ -1264,14 +1261,13 @@ int shmem_unuse(unsigned int type, bool
>   		next = list_next_entry(info, swaplist);
>   		if (!info->swapped)
>   			list_del_init(&info->swaplist);
> +		if (atomic_dec_and_test(&info->stop_eviction))
> +			wake_up_var(&info->stop_eviction);
>   		if (error)
>   			break;
>   	}
>   	mutex_unlock(&shmem_swaplist_mutex);
>   
> -	if (prev_inode)
> -		iput(prev_inode);
> -
>   	return error;
>   }
>   
> @@ -2238,6 +2234,7 @@ static struct inode *shmem_get_inode(str
>   		info = SHMEM_I(inode);
>   		memset(info, 0, (char *)inode - (char *)info);
>   		spin_lock_init(&info->lock);
> +		atomic_set(&info->stop_eviction, 0);
>   		info->seals = F_SEAL_SEAL;
>   		info->flags = flags & VM_NORESERVE;
>   		INIT_LIST_HEAD(&info->shrinklist);
> 

[-- Attachment #2: shmem-fix-race-between-shmem_unuse-and-umount --]
[-- Type: text/plain, Size: 2212 bytes --]

shmem: fix race between shmem_unuse and umount

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Function shmem_unuse could race with generic_shutdown_super.
Inode reference is not enough for preventing umount and freeing superblock.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/shmem.c |   24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index b3db3779a30a..2018a9a96bb7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1218,6 +1218,10 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type,
 	return ret;
 }
 
+static void shmem_synchronize_umount(struct super_block *sb, void *arg)
+{
+}
+
 /*
  * Read all the shared memory data that resides in the swap
  * device 'type' back into memory, so the swap device can be
@@ -1229,6 +1233,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
 	struct shmem_inode_info *info, *next;
 	struct inode *inode;
 	struct inode *prev_inode = NULL;
+	struct super_block *sb;
 	int error = 0;
 
 	if (list_empty(&shmem_swaplist))
@@ -1247,9 +1252,22 @@ int shmem_unuse(unsigned int type, bool frontswap,
 			continue;
 		}
 
+		/*
+		 * Lock superblock to prevent umount and freeing it under us.
+		 * If umount in progress it will free swap enties.
+		 *
+		 * Must be done before grabbing inode reference, otherwise
+		 * generic_shutdown_super() will complain about busy inodes.
+		 */
+		sb = info->vfs_inode.i_sb;
+		if (!trylock_super(sb))
+			continue;
+
 		inode = igrab(&info->vfs_inode);
-		if (!inode)
+		if (!inode) {
+			up_read(&sb->s_umount);
 			continue;
+		}
 
 		mutex_unlock(&shmem_swaplist_mutex);
 		if (prev_inode)
@@ -1258,6 +1276,7 @@ int shmem_unuse(unsigned int type, bool frontswap,
 
 		error = shmem_unuse_inode(inode, type, frontswap,
 					  fs_pages_to_unuse);
+		up_read(&sb->s_umount);
 		cond_resched();
 
 		mutex_lock(&shmem_swaplist_mutex);
@@ -1272,6 +1291,9 @@ int shmem_unuse(unsigned int type, bool frontswap,
 	if (prev_inode)
 		iput(prev_inode);
 
+	/* Wait for umounts, this grabs s_umount for each superblock. */
+	iterate_supers_type(&shmem_fs_type, shmem_synchronize_umount, NULL);
+
 	return error;
 }