From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BE59C432C0 for ; Tue, 26 Nov 2019 23:35:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EF8E02068E for ; Tue, 26 Nov 2019 23:34:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EF8E02068E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 58C1D6B0333; Tue, 26 Nov 2019 18:34:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 53D476B0340; Tue, 26 Nov 2019 18:34:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 45C196B0341; Tue, 26 Nov 2019 18:34:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0049.hostedemail.com [216.40.44.49]) by kanga.kvack.org (Postfix) with ESMTP id 3085F6B0333 for ; Tue, 26 Nov 2019 18:34:59 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id D1A1E180AD817 for ; Tue, 26 Nov 2019 23:34:58 +0000 (UTC) X-FDA: 76200036276.06.tree28_2b530a9814e15 X-HE-Tag: tree28_2b530a9814e15 X-Filterd-Recvd-Size: 6783 Received: from out4436.biz.mail.alibaba.com (out4436.biz.mail.alibaba.com [47.88.44.36]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Tue, 26 Nov 2019 23:34:57 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R781e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0TjAobng_1574811280; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TjAobng_1574811280) by smtp.aliyun-inc.com(127.0.0.1); Wed, 27 Nov 2019 07:34:44 +0800 Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially From: Yang Shi To: "Kirill A. Shutemov" Cc: hughd@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com> <20191125093611.hlamtyo4hvefwibi@box> <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com> <20191125183350.5gmcln6t3ofszbsy@box> <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com> Message-ID: <14b7c24b-706e-79cf-6fbc-f3c042f30f06@linux.alibaba.com> Date: Tue, 26 Nov 2019 15:34:40 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11/25/19 11:33 AM, Yang Shi wrote: > > > On 11/25/19 10:33 AM, Kirill A. Shutemov wrote: >> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote: >>> >>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: >>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: >>>>> Currently when truncating shmem file, if the range is partial of TH= P >>>>> (start or end is in the middle of THP), the pages actually will=20 >>>>> just get >>>>> cleared rather than being freed unless the range cover the whole TH= P. >>>>> Even though all the subpages are truncated (randomly or=20 >>>>> sequentially), >>>>> the THP may still be kept in page cache.=C2=A0 This might be fine f= or some >>>>> usecases which prefer preserving THP. >>>>> >>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole=20 >>>>> punch >>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not=20 >>>>> used. >>>>> So, when using shmem THP as memory backend QEMU inflation actually=20 >>>>> doesn't >>>>> work as expected since it doesn't free memory.=C2=A0 But, the infla= tion >>>>> usecase really needs get the memory freed.=C2=A0 Anonymous THP will= not=20 >>>>> get >>>>> freed right away too but it will be freed eventually when all=20 >>>>> subpages are >>>>> unmapped, but shmem THP would still stay in page cache. >>>>> >>>>> To protect the usecases which may prefer preserving THP, introduce = a >>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting=20 >>>>> THP is >>>>> preferred behavior if truncating partial THP.=C2=A0 This mode just = makes >>>>> sense to tmpfs for the time being. >>>> We need to clarify interaction with khugepaged. This implementation >>>> doesn't do anything to prevent khugepaged from collapsing the range=20 >>>> back >>>> to THP just after the split. >>> Yes, it doesn't. Will clarify this in the commit log. >> Okay, but I'm not sure that documention alone will be enough. We need >> proper design. > > Maybe we could try to hold inode lock with read during=20 > collapse_file(). The shmem fallocate does acquire inode lock with=20 > write, this should be able to synchronize hole punch and khugepaged.=20 > And, shmem just needs hold inode lock for llseek and fallocate, I'm=20 > supposed they are should be called not that frequently to have impact=20 > on khugepaged. The llseek might be often, but it should be quite fast.=20 > However, they might get blocked by khugepaged. > > It sounds safe to hold a rwsem during collapsing THP. > > Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch=20 > and clear it after truncate, then check the flag before doing collapse=20 > in khugepaged. khugepaged should not need hold the inode lock during=20 > collapse since it could be released after the flag is checked. By relooking the code, it looks the latter one (check VM_NOHUGEPAGE)=20 doesn't make sense, it can't prevent khugepaged from collapsing THP in=20 parallel. > >> >>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode=20 >>>>> *inode, loff_t lstart, loff_t lend, >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 } >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 unlock_page(page); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> +rescan_split: >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pageve= c_remove_exceptionals(&pvec); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pageve= c_release(&pvec); >>>>> + >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (split && PageTransC= ompound(page)) { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= /* The THP may get freed under us */ >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= if (!get_page_unless_zero(compound_head(page))) >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 goto rescan_out; >>>>> + >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= lock_page(page); >>>>> + >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= /* >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * The extra pins from page cache lookup have been >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * released by pagevec_release(). >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 */ >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= if (!split_huge_page(page)) { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 unlock_page(page); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 put_page(page); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 /* Re-look up page cache from current index */ >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 goto again; >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= } >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= unlock_page(page); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= put_page(page); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> +rescan_out: >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 index+= +; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> Doing get_page_unless_zero() just after you've dropped the pin for t= he >>>> page looks very suboptimal. >>> If I don't drop the pins the THP can't be split. And, there might be=20 >>> more >>> than one pins from find_get_entries() if I read the code correctly. F= or >>> example, truncate 8K length in the middle of THP, the THP's refcount=20 >>> would >>> get bumpped twice since=C2=A0 two sub pages would be returned. >> Pin the page before pagevec_release() and avoid get_page_unless_zero()= . >> >> Current code is buggy. You need to check that the page is still=20 >> belong to >> the file after speculative lookup. > > Yes, I missed this point. Thanks for the suggestion. > >> >