From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=taI3=ZS=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8BE59C432C0
	for <linux-mm@archiver.kernel.org>; Tue, 26 Nov 2019 23:35:00 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id EF8E02068E
	for <linux-mm@archiver.kernel.org>; Tue, 26 Nov 2019 23:34:59 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EF8E02068E
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 58C1D6B0333; Tue, 26 Nov 2019 18:34:59 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 53D476B0340; Tue, 26 Nov 2019 18:34:59 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 45C196B0341; Tue, 26 Nov 2019 18:34:59 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0049.hostedemail.com [216.40.44.49])
	by kanga.kvack.org (Postfix) with ESMTP id 3085F6B0333
	for <linux-mm@kvack.org>; Tue, 26 Nov 2019 18:34:59 -0500 (EST)
Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with SMTP id D1A1E180AD817
	for <linux-mm@kvack.org>; Tue, 26 Nov 2019 23:34:58 +0000 (UTC)
X-FDA: 76200036276.06.tree28_2b530a9814e15
X-HE-Tag: tree28_2b530a9814e15
X-Filterd-Recvd-Size: 6783
Received: from out4436.biz.mail.alibaba.com (out4436.biz.mail.alibaba.com [47.88.44.36])
	by imf33.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 26 Nov 2019 23:34:57 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R781e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0TjAobng_1574811280;
Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TjAobng_1574811280)
          by smtp.aliyun-inc.com(127.0.0.1);
          Wed, 27 Nov 2019 07:34:44 +0800
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP
 partially
From: Yang Shi <yang.shi@linux.alibaba.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: hughd@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com,
 akpm@linux-foundation.org, linux-mm@kvack.org,
 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com>
 <20191125093611.hlamtyo4hvefwibi@box>
 <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com>
 <20191125183350.5gmcln6t3ofszbsy@box>
 <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com>
Message-ID: <14b7c24b-706e-79cf-6fbc-f3c042f30f06@linux.alibaba.com>
Date: Tue, 26 Nov 2019 15:34:40 -0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0)
 Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 11/25/19 11:33 AM, Yang Shi wrote:
>
>
> On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
>> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
>>>
>>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
>>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
>>>>> Currently when truncating shmem file, if the range is partial of TH=
P
>>>>> (start or end is in the middle of THP), the pages actually will=20
>>>>> just get
>>>>> cleared rather than being freed unless the range cover the whole TH=
P.
>>>>> Even though all the subpages are truncated (randomly or=20
>>>>> sequentially),
>>>>> the THP may still be kept in page cache.=C2=A0 This might be fine f=
or some
>>>>> usecases which prefer preserving THP.
>>>>>
>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole=20
>>>>> punch
>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not=20
>>>>> used.
>>>>> So, when using shmem THP as memory backend QEMU inflation actually=20
>>>>> doesn't
>>>>> work as expected since it doesn't free memory.=C2=A0 But, the infla=
tion
>>>>> usecase really needs get the memory freed.=C2=A0 Anonymous THP will=
 not=20
>>>>> get
>>>>> freed right away too but it will be freed eventually when all=20
>>>>> subpages are
>>>>> unmapped, but shmem THP would still stay in page cache.
>>>>>
>>>>> To protect the usecases which may prefer preserving THP, introduce =
a
>>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting=20
>>>>> THP is
>>>>> preferred behavior if truncating partial THP.=C2=A0 This mode just =
makes
>>>>> sense to tmpfs for the time being.
>>>> We need to clarify interaction with khugepaged. This implementation
>>>> doesn't do anything to prevent khugepaged from collapsing the range=20
>>>> back
>>>> to THP just after the split.
>>> Yes, it doesn't. Will clarify this in the commit log.
>> Okay, but I'm not sure that documention alone will be enough. We need
>> proper design.
>
> Maybe we could try to hold inode lock with read during=20
> collapse_file(). The shmem fallocate does acquire inode lock with=20
> write, this should be able to synchronize hole punch and khugepaged.=20
> And, shmem just needs hold inode lock for llseek and fallocate, I'm=20
> supposed they are should be called not that frequently to have impact=20
> on khugepaged. The llseek might be often, but it should be quite fast.=20
> However, they might get blocked by khugepaged.
>
> It sounds safe to hold a rwsem during collapsing THP.
>
> Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch=20
> and clear it after truncate, then check the flag before doing collapse=20
> in khugepaged. khugepaged should not need hold the inode lock during=20
> collapse since it could be released after the flag is checked.

By relooking the code, it looks the latter one (check VM_NOHUGEPAGE)=20
doesn't make sense, it can't prevent khugepaged from collapsing THP in=20
parallel.

>
>>
>>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode=20
>>>>> *inode, loff_t lstart, loff_t lend,
>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 }
>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 unlock_page(page);
>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
>>>>> +rescan_split:
>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pageve=
c_remove_exceptionals(&pvec);
>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pageve=
c_release(&pvec);
>>>>> +
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (split && PageTransC=
ompound(page)) {
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 /* The THP may get freed under us */
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 if (!get_page_unless_zero(compound_head(page)))
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 goto rescan_out;
>>>>> +
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 lock_page(page);
>>>>> +
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 /*
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 * The extra pins from page cache lookup have been
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 * released by pagevec_release().
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 */
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 if (!split_huge_page(page)) {
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 unlock_page(page);
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 put_page(page);
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 /* Re-look up page cache from current index */
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 goto again;
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 }
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 unlock_page(page);
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 put_page(page);
>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
>>>>> +rescan_out:
>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 index+=
+;
>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
>>>> Doing get_page_unless_zero() just after you've dropped the pin for t=
he
>>>> page looks very suboptimal.
>>> If I don't drop the pins the THP can't be split. And, there might be=20
>>> more
>>> than one pins from find_get_entries() if I read the code correctly. F=
or
>>> example, truncate 8K length in the middle of THP, the THP's refcount=20
>>> would
>>> get bumpped twice since=C2=A0 two sub pages would be returned.
>> Pin the page before pagevec_release() and avoid get_page_unless_zero()=
.
>>
>> Current code is buggy. You need to check that the page is still=20
>> belong to
>> the file after speculative lookup.
>
> Yes, I missed this point. Thanks for the suggestion.
>
>>
>