From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10E6AC4332D for ; Thu, 19 Mar 2020 17:22:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9A8AD2080C for ; Thu, 19 Mar 2020 17:22:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9A8AD2080C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2FE9E6B0003; Thu, 19 Mar 2020 13:22:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2B10A6B0005; Thu, 19 Mar 2020 13:22:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1A0616B0007; Thu, 19 Mar 2020 13:22:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0201.hostedemail.com [216.40.44.201]) by kanga.kvack.org (Postfix) with ESMTP id 038586B0003 for ; Thu, 19 Mar 2020 13:22:55 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 8BCB680832CA for ; Thu, 19 Mar 2020 17:22:55 +0000 (UTC) X-FDA: 76612781910.14.home58_537ba9642cf3e X-HE-Tag: home58_537ba9642cf3e X-Filterd-Recvd-Size: 7230 Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com [115.124.30.43]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Thu, 19 Mar 2020 17:22:53 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R861e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0Tt2d.wo_1584638566; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tt2d.wo_1584638566) by smtp.aliyun-inc.com(127.0.0.1); Fri, 20 Mar 2020 01:22:48 +0800 Subject: Re: [PATCH] mm: khugepaged: fix potential page state corruption From: Yang Shi To: "Kirill A. Shutemov" Cc: kirill.shutemov@linux.intel.com, hughd@google.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1584573582-116702-1-git-send-email-yang.shi@linux.alibaba.com> <20200319001258.creziw6ffw4jvwl3@box> <2cdc734c-c222-4b9d-9114-1762b29dafb4@linux.alibaba.com> <20200319104938.vphyajoyz6ob6jtl@box> Message-ID: Date: Thu, 19 Mar 2020 10:22:46 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 3/19/20 9:57 AM, Yang Shi wrote: > > > On 3/19/20 3:49 AM, Kirill A. Shutemov wrote: >> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote: >>> >>> On 3/18/20 5:55 PM, Yang Shi wrote: >>>> >>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote: >>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote: >>>>>> When khugepaged collapses anonymous pages, the base pages would >>>>>> be freed >>>>>> via pagevec or free_page_and_swap_cache().=C2=A0 But, the anonymou= s=20 >>>>>> page may >>>>>> be added back to LRU, then it might result in the below race: >>>>>> >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0CPU A=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CPU B >>>>>> khugepaged: >>>>>> =C2=A0=C2=A0=C2=A0 unlock page >>>>>> =C2=A0=C2=A0=C2=A0 putback_lru_page >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 add to lru >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 page reclaim: >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 isolate this page >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 try_to_unmap >>>>>> =C2=A0=C2=A0=C2=A0 page_remove_rmap <-- corrupt _mapcount >>>>>> >>>>>> It looks nothing would prevent the pages from isolating by=20 >>>>>> reclaimer. >>>>> Hm. Why should it? >>>>> >>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount i= s >>>>> protected by ptl. And this particular _mapcount pin is reachable fo= r >>>>> reclaim as it's not part of usual page table tree. Basically >>>>> try_to_unmap() will never succeeds until we give up the _mapcount o= n >>>>> khugepaged side. >>>> I don't quite get. What does "not part of usual page table tree"=20 >>>> means? >>>> >>>> How's about try_to_unmap() acquires ptl before khugepaged? >> The page table we are dealing with was detached from the process' page >> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the >> pte. >> >> try_to_unmap() can only reach the ptl if split ptl is disabled >> (mm->page_table_lock is used), but it still will not be able to reach=20 >> pte. > > Aha, got it. Thanks for explaining. I definitely missed this point.=20 > Yes, pmdp_collapse_flush() would clear the pmd, then others won't see=20 > the page table. > > However, it looks the vmscan would not stop at try_to_unmap() at all,=20 > try_to_unmap() would just return true since pmd_present() should=20 > return false in pvmw. Then it would go all the way down to=20 > __remove_mapping(), but freezing the page would fail since=20 > try_to_unmap() doesn't actually drop the refcount from the pte map. > > It would not result in any critical problem AFAICT, but suboptimal and=20 > it may causes some unnecessary I/O due to swap. To correct, it would not reach __remove_mapping() since refcount check=20 in pageout() would fail. > >> >>>>> I don't see the issue right away. >>>>> >>>>>> The other problem is the page's active or unevictable flag might b= e >>>>>> still set when freeing the page via free_page_and_swap_cache(). >>>>> So what? >>>> The flags may leak to page free path then kernel may complain if >>>> DEBUG_VM is set. >> Could you elaborate on what codepath you are talking about? > > __put_page -> > =C2=A0=C2=A0=C2=A0 __put_single_page -> > =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 free_unref_page -> > =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 put_unref_page= _prepare -> > =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2= =A0 free_pcp_prepare -> > =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2= =A0 =C2=A0=C2=A0=C2=A0 free_pages_prepare -> > =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2= =A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 free_pages_check > > This check would just be run when DEBUG_VM is enabled. > >> >>>>>> The putback_lru_page() would not clear those two flags if the=20 >>>>>> pages are >>>>>> released via pagevec, it sounds nothing prevents from isolating=20 >>>>>> active >>> Sorry, this is a typo. If the page is freed via pagevec, active and >>> unevictable flag would get cleared before freeing by page_off_lru(). >>> >>> But, if the page is freed by free_page_and_swap_cache(), these two=20 >>> flags are >>> not cleared. But, it seems this path is hit rare, the pages are=20 >>> freed by >>> pagevec for the most cases. >>> >>>>>> or unevictable pages. >>>>> Again, why should it? vmscan is equipped to deal with this. >>>> I don't mean vmscan, I mean khugepaged may isolate active and >>>> unevictable pages since it just simply walks page table. >> Why it is wrong? lru_cache_add() only complains if both flags set, it >> shouldn't happen. > > Noting wrong about isolating active or unevictable pages. I just mean=20 > it seems possible active or unevictable flag may be there if the page=20 > is freed via free_page_add_swap_cache() path. > >> >>>>>> However I didn't really run into these problems, just in theory >>>>>> by visual >>>>>> inspection. >>>>>> >>>>>> And, it also seems unnecessary to have the pages add back to LRU >>>>>> again since >>>>>> they are about to be freed when reaching this point. So, >>>>>> clearing active >>>>>> and unevictable flags, unlocking and dropping refcount from isolat= e >>>>>> instead of calling putback_lru_page() as what page cache collapse=20 >>>>>> does. >>>>> Hm? But we do call putback_lru_page() on the way out. I do not=20 >>>>> follow. >>>> It just calls putback_lru_page() at error path, not success path. >>>> Putting pages back to lru on error path definitely makes sense.=20 >>>> Here it >>>> is the success path. >> I agree that putting the apage on LRU just before free the page is >> suboptimal, but I don't see it as a critical issue. > > Yes, given the code analysis above, I agree. If you thought the patch=20 > is a fine micro-optimization, I would like to re-submit it with=20 > rectified commit log. Thank you for your time. > >> >> >