From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 58B9CC282DE for ; Mon, 8 Apr 2019 07:19:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0B5F820857 for ; Mon, 8 Apr 2019 07:19:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=yandex-team.ru header.i=@yandex-team.ru header.b="Qsbv9FVu" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726228AbfDHHTa (ORCPT ); Mon, 8 Apr 2019 03:19:30 -0400 Received: from forwardcorp1p.mail.yandex.net ([77.88.29.217]:42268 "EHLO forwardcorp1p.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725877AbfDHHTa (ORCPT ); Mon, 8 Apr 2019 03:19:30 -0400 Received: from mxbackcorp1o.mail.yandex.net (mxbackcorp1o.mail.yandex.net [IPv6:2a02:6b8:0:1a2d::301]) by forwardcorp1p.mail.yandex.net (Yandex) with ESMTP id 781052E1499; Mon, 8 Apr 2019 10:19:26 +0300 (MSK) Received: from smtpcorp1o.mail.yandex.net (smtpcorp1o.mail.yandex.net [2a02:6b8:0:1a2d::30]) by mxbackcorp1o.mail.yandex.net (nwsmtp/Yandex) with ESMTP id MYareibynx-JPLWSl6j; Mon, 08 Apr 2019 10:19:26 +0300 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1554707966; bh=gSlaW2S1GnOiFkldJxqZMxuprNIHjJiDHduvTL2+kAs=; h=In-Reply-To:Message-ID:From:Date:References:To:Subject:Cc; b=Qsbv9FVuQ5T84p4a7cGDThb4cvXPLbxq/0R0aiRRvU9UGJI6Hmy2JnFKGg1UrgmgP yUlefQvKvBO4j42UyVb7bDeBDWDAQEE+o9flpTqblGo3N/5C8rDLwxWVFRzZW8LRNH Qv1wFykvmsEHqyxIcCvo8Ijq2wUzPfrGob8lCnEU= Authentication-Results: mxbackcorp1o.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Received: from dynamic-red.dhcp.yndx.net (dynamic-red.dhcp.yndx.net [2a02:6b8:0:40c:f5ec:9361:ed45:768f]) by smtpcorp1o.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id 9JxVp63SfO-JPLOgjrA; Mon, 08 Apr 2019 10:19:25 +0300 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client certificate not present) Subject: Re: shmem_recalc_inode: unable to handle kernel NULL pointer dereference To: Hugh Dickins Cc: "Alex Xu (Hello71)" , Vineeth Pillai , Andrew Morton , Kelley Nielsen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Rik van Riel , Huang Ying , Al Viro References: <1553440122.7s759munpm.astroid@alex-desktop.none> <1554048843.jjmwlalntd.astroid@alex-desktop.none> <56deb587-8cd6-317a-520f-209207468c55@yandex-team.ru> From: Konstantin Khlebnikov Message-ID: <1b0bc97a-8162-d4df-7187-7636e5934b23@yandex-team.ru> Date: Mon, 8 Apr 2019 10:19:24 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-CA Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08.04.2019 9:05, Hugh Dickins wrote: > On Fri, 5 Apr 2019, Konstantin Khlebnikov wrote: >> On 05.04.2019 5:12, Hugh Dickins wrote: >>> Hi Alex, could you please give the patch below a try? It fixes a >>> problem, but I'm not sure that it's your problem - please let us know. >>> >>> I've not yet written up the commit description, and this should end up >>> as 4/4 in a series fixing several new swapoff issues: I'll wait to post >>> the finished series until heard back from you. >>> >>> I did first try following the suggestion Konstantin had made back then, >>> for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active). >>> >>> But it turned out to be difficult to get right in shmem_unuse(), because >>> of the way that relies on the inode as a cursor in the list - problem >>> when you've acquired an s_active reference, but fail to acquire inode >>> reference, and cannot safely release the s_active reference while still >>> holding the swaplist mutex. >>> >>> If VFS offered an isgrab(inode), like igrab() but acquiring s_active >>> reference while holding i_lock, that would drop very easily into the >>> current shmem_unuse() as a replacement there for igrab(). But the rest >>> of the world has managed without that for years, so I'm disinclined to >>> add it just for this. And the patch below seems good enough without it. >>> >>> Thanks, >>> Hugh >>> >>> --- >>> >>> include/linux/shmem_fs.h | 1 + >>> mm/shmem.c | 39 ++++++++++++++++++--------------------- >>> 2 files changed, 19 insertions(+), 21 deletions(-) >>> >>> --- 5.1-rc3/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700 >>> +++ linux/include/linux/shmem_fs.h 2019-04-04 16:18:08.193512968 -0700 >>> @@ -21,6 +21,7 @@ struct shmem_inode_info { >>> struct list_head swaplist; /* chain of maybes on swap */ >>> struct shared_policy policy; /* NUMA memory alloc policy >>> */ >>> struct simple_xattrs xattrs; /* list of xattrs */ >>> + atomic_t stop_eviction; /* hold when working on inode >>> */ >>> struct inode vfs_inode; >>> }; >>> --- 5.1-rc3/mm/shmem.c 2019-03-17 16:18:15.701823872 -0700 >>> +++ linux/mm/shmem.c 2019-04-04 16:18:08.193512968 -0700 >>> @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino >>> } >>> spin_unlock(&sbinfo->shrinklist_lock); >>> } >>> - if (!list_empty(&info->swaplist)) { >>> + while (!list_empty(&info->swaplist)) { >>> + /* Wait while shmem_unuse() is scanning this inode... >>> */ >>> + wait_var_event(&info->stop_eviction, >>> + !atomic_read(&info->stop_eviction)); >>> mutex_lock(&shmem_swaplist_mutex); >>> list_del_init(&info->swaplist); >>> + /* ...but beware of the race if we peeked too early >>> */ >>> + if (!atomic_read(&info->stop_eviction)) >>> + list_del_init(&info->swaplist); >>> mutex_unlock(&shmem_swaplist_mutex); >>> } >>> } >>> @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool >>> unsigned long *fs_pages_to_unuse) >>> { >>> struct shmem_inode_info *info, *next; >>> - struct inode *inode; >>> - struct inode *prev_inode = NULL; >>> int error = 0; >>> if (list_empty(&shmem_swaplist)) >>> return 0; >>> mutex_lock(&shmem_swaplist_mutex); >>> - >>> - /* >>> - * The extra refcount on the inode is necessary to safely dereference >>> - * p->next after re-acquiring the lock. New shmem inodes with swap >>> - * get added to the end of the list and we will scan them all. >>> - */ >>> list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) { >>> if (!info->swapped) { >>> list_del_init(&info->swaplist); >>> continue; >>> } >>> - >>> - inode = igrab(&info->vfs_inode); >>> - if (!inode) >>> - continue; >>> - >>> + /* >>> + * Drop the swaplist mutex while searching the inode for >>> swap; >>> + * but before doing so, make sure shmem_evict_inode() will >>> not >>> + * remove placeholder inode from swaplist, nor let it be >>> freed >>> + * (igrab() would protect from unlink, but not from unmount). >>> + */ >>> + atomic_inc(&info->stop_eviction); >>> mutex_unlock(&shmem_swaplist_mutex); >>> - if (prev_inode) >>> - iput(prev_inode); >>> - prev_inode = inode; >> This seems too ad hoc solution. > > I see what you mean by "ad hoc", but disagree with "too" ad hoc: > it's an appropriate solution, and a general one - I didn't invent it > for this, but for the huge tmpfs recoveries work items four years ago; > just changed the name from "info->recoveries" to "info->stop_eviction" > to let it be generalized to this swapoff case. > > I prefer mine, since it simplifies shmem_unuse() (no igrab!), and has > the nice (but admittedly not essential) property of letting swapoff > proceed without delay and without unnecessary locking on unmounting > filesystems and evicting inodes. (Would I prefer to use the s_umount > technique for my recoveries case? I think not.) > > But yours should work too, with a slight change - see comments below, > where I've inlined yours. I'd better get on and post my four fixes > tomorrow, whether or not they fix Alex's case; then if people prefer > yours to my 4/4, yours can be swapped in instead. > Ok. But both swapoff and tmpfs umount does not look like operations that should be concurrent by any cost. >> shmem: fix race between shmem_unuse and umount >> >> From: Konstantin Khlebnikov >> >> Function shmem_unuse could race with generic_shutdown_super. >> Inode reference is not enough for preventing umount and freeing superblock. >> >> Signed-off-by: Konstantin Khlebnikov >> --- >> mm/shmem.c | 24 +++++++++++++++++++++++- >> 1 file changed, 23 insertions(+), 1 deletion(-) >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index b3db3779a30a..2018a9a96bb7 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -1218,6 +1218,10 @@ static int shmem_unuse_inode(struct inode *inode, unsigned int type, >> return ret; >> } >> >> +static void shmem_synchronize_umount(struct super_block *sb, void *arg) >> +{ >> +} >> + > > I think this can go away, see below. > >> /* >> * Read all the shared memory data that resides in the swap >> * device 'type' back into memory, so the swap device can be >> @@ -1229,6 +1233,7 @@ int shmem_unuse(unsigned int type, bool frontswap, >> struct shmem_inode_info *info, *next; >> struct inode *inode; >> struct inode *prev_inode = NULL; >> + struct super_block *sb; >> int error = 0; >> >> if (list_empty(&shmem_swaplist)) >> @@ -1247,9 +1252,22 @@ int shmem_unuse(unsigned int type, bool frontswap, >> continue; >> } >> >> + /* >> + * Lock superblock to prevent umount and freeing it under us. >> + * If umount in progress it will free swap enties. >> + * >> + * Must be done before grabbing inode reference, otherwise >> + * generic_shutdown_super() will complain about busy inodes. >> + */ >> + sb = info->vfs_inode.i_sb; >> + if (!trylock_super(sb)) > > Right, trylock important there. > >> + continue; >> + >> inode = igrab(&info->vfs_inode); >> - if (!inode) >> + if (!inode) { >> + up_read(&sb->s_umount); > > Yes, that indeed avoids the difficulty I had with when to call > deactivate_super(), that put me off trying to use s_active. > >> continue; >> + } >> >> mutex_unlock(&shmem_swaplist_mutex); >> if (prev_inode) >> @@ -1258,6 +1276,7 @@ int shmem_unuse(unsigned int type, bool frontswap, >> >> error = shmem_unuse_inode(inode, type, frontswap, >> fs_pages_to_unuse); >> + up_read(&sb->s_umount); > > No, not here. I think you have to note prev_sb, and then only > up_read(&prev_sb->s_umount) after each iput(prev_inode): otherwise > there's still a risk of "Self-destruct in 5 seconds", isn't there? Oh yes. So, this code have to swap sb locks above with this monster if (sb != info->vfs_inode.i_sb) { if (sb) up_read(&sb->s_umount); sb = NULL; if (!trylock_super(info->vfs_inode.i_sb)) continue; sb = info->vfs_inode.i_sb } Locking shmem_swaplist_mutex under s_umount should be fine. Also I looking into idea of treating swapoff like reverse-writeback: -> iterate over superblocks -> lock s_umount with normal down_read -> iterate over inodes -> iterate over inode tags -> ... Whole code will be more natural in this way. > >> cond_resched(); >> >> mutex_lock(&shmem_swaplist_mutex); >> @@ -1272,6 +1291,9 @@ int shmem_unuse(unsigned int type, bool frontswap, >> if (prev_inode) >> iput(prev_inode); >> >> + /* Wait for umounts, this grabs s_umount for each superblock. */ >> + iterate_supers_type(&shmem_fs_type, shmem_synchronize_umount, NULL); >> + > > I guess that's an attempt to compensate for the somewhat unsatisfactory > trylock above (bearing in mind the SWAP_UNUSE_MAX_TRIES 3, but I remove > that in my 2/4). Nice idea, and if it had the effect of never needing to > retry shmem_unuse(), I'd say yes; but since you're still passing over > un-igrab()-able inodes without an equivalent synchronization, I think > this odd iterate_supers_type() just delays swapoff without buying any > guarantee: better just deleted to keep your patch simpler. Yep, robust algorithm is better than try-3-times-and-give-up =) (could hide bugs for ages) I suppose your solution will wait for wakeup from shmem_evict_inode()? That should work. In more general design this could be something like __wait_on_freeing_inode(), but with killable wait. > >> return error; >> } >> > > Thanks, > Hugh >