From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EE201C433F5 for ; Mon, 8 Nov 2021 03:26:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3E0D361350 for ; Mon, 8 Nov 2021 03:26:32 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 3E0D361350 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 3AACF6B00AC; Sun, 7 Nov 2021 22:26:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 358D36B00AD; Sun, 7 Nov 2021 22:26:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 247D06B00B0; Sun, 7 Nov 2021 22:26:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0129.hostedemail.com [216.40.44.129]) by kanga.kvack.org (Postfix) with ESMTP id 149B56B00AC for ; Sun, 7 Nov 2021 22:26:31 -0500 (EST) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B03A575CB6 for ; Mon, 8 Nov 2021 03:26:30 +0000 (UTC) X-FDA: 78784325340.31.DC18329 Received: from out4436.biz.mail.alibaba.com (out4436.biz.mail.alibaba.com [47.88.44.36]) by imf18.hostedemail.com (Postfix) with ESMTP id 004244001E89 for ; Mon, 8 Nov 2021 03:25:59 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=ningzhang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0UvSo26n_1636341862; Received: from ali-074845.local(mailfrom:ningzhang@linux.alibaba.com fp:SMTPD_---0UvSo26n_1636341862) by smtp.aliyun-inc.com(127.0.0.1); Mon, 08 Nov 2021 11:24:34 +0800 Subject: Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat To: Michal Hocko Cc: linux-mm@kvack.org, Andrew Morton , Johannes Weiner , Vladimir Davydov , Yu Zhao References: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com> From: ning zhang Message-ID: <9f1e3299-90d6-8a9b-1705-a31a97317644@linux.alibaba.com> Date: Mon, 8 Nov 2021 11:24:22 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/alternative; boundary="------------BF8CF7B91C6EB91E6E819268" Authentication-Results: imf18.hostedemail.com; dkim=none; spf=temperror (imf18.hostedemail.com: error in processing during lookup of ningzhang@linux.alibaba.com: DNS error) smtp.mailfrom=ningzhang@linux.alibaba.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 004244001E89 X-Stat-Signature: 31bge1zn5c6jntn7dxjkiuf17fa7ahxw X-HE-Tag: 1636341959-403139 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is a multi-part message in MIME format. --------------BF8CF7B91C6EB91E6E819268 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable =E5=9C=A8 2021/11/1 =E4=B8=8B=E5=8D=885:20, Michal Hocko =E5=86=99=E9=81=93= : > On Sat 30-10-21 00:12:53, ning zhang wrote: >> =E5=9C=A8 2021/10/29 =E4=B8=8B=E5=8D=889:38, Michal Hocko =E5=86=99=E9= =81=93: >>> On Thu 28-10-21 19:56:49, Ning Zhang wrote: >>>> As we know, thp may lead to memory bloat which may cause OOM. >>>> Through testing with some apps, we found that the reason of >>>> memory bloat is a huge page may contain some zero subpages >>>> (may accessed or not). And we found that most zero subpages >>>> are centralized in a few huge pages. >>>> >>>> Following is a text_classification_rnn case for tensorflow: >>>> >>>> zero_subpages huge_pages waste >>>> [ 0, 1) 186 0.00% >>>> [ 1, 2) 23 0.01% >>>> [ 2, 4) 36 0.02% >>>> [ 4, 8) 67 0.08% >>>> [ 8, 16) 80 0.23% >>>> [ 16, 32) 109 0.61% >>>> [ 32, 64) 44 0.49% >>>> [ 64, 128) 12 0.30% >>>> [ 128, 256) 28 1.54% >>>> [ 256, 513) 159 18.03% >>>> >>>> In the case, there are 187 huge pages (25% of the total huge pages) >>>> which contain more then 128 zero subpages. And these huge pages >>>> lead to 19.57% waste of the total rss. It means we can reclaim >>>> 19.57% memory by splitting the 187 huge pages and reclaiming the >>>> zero subpages. >>> What is the THP policy configuration in your testing? I assume you ar= e >>> using defaults right? That would be always for THP and madvise for >>> defrag. Would it make more sense to use madvise mode for THP for your >>> workload? The THP code is rather complex and just by looking at the >>> diffstat this add quite a lot on top. Is this really worth it? >> The THP configuration is always. >> >> Madvise needs users to set MADV_HUGEPAGE by themselves if they want us= e huge >> page, while many users don't do set this, and they can't control this = well. > What do you mean tey can't control this well? I means they don't know where they should use THP. And even if they use madvise, memory bloat still exists. > >> Such as java, users can set heap and metaspace to use huge pages with >> madvise, but there is also memory bloat. Users still need to test whet= her >> their app can accept the waste. > There will always be some internal fragmentation when huge pages are > used. The amount will depend on how well the memory is used but huge > pages give a performance boost in return. > > If the memory bloat is a significant problem then overeager THP usage i= s > certainly not good and I would argue that applying THP always policy is > not a proper configuration. No matter how much the MM code can try to > fix up the situation it will be always a catch up game. > =20 >> For the case above, if we set THP configuration to be madvise, all the= pages >> it uses will be 4K-page. >> >> Memory bloat is one of the most important reasons that users disable T= HP. >> We do this to popularize THP to be default enabled. > To my knowledge the most popular reason to disable THP is the runtime > overhead. A large part of that overhead has been reduced by not doing > heavy compaction during the page fault allocations by default. Memory > overhead is certainly an important aspect as well but there is always > a possibility to reduce that by reducing it to madvised regions for > page fault (i.e. those where author of the code has considered the > costs vs. benefits of the huge page) and setting up a conservative > khugepaged policy. So there are existing tools available. You are tryin= g > to add quite a lot of code so you should have good arguments to add mor= e > complexity. I am not sure that popularizing THP is a strong one TBH. Sorry for relpying late. For the compaction, we can set defrag of THP to be defer or never, to avoid overhead produced by direct reclaim. However, there are no way to reduce memory bloat. If the memory usage reach the limit, and we can't reclaim some pages, the OOM will be triggered and the process will be killed. Our patchest is to avoid OOM. Much code is interface to control ZSR. And we will try to reduce the complexity. --------------BF8CF7B91C6EB91E6E819268 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable


=E5=9C=A8 2021/11/1 =E4=B8=8B=E5=8D=88= 5:20, Michal Hocko =E5=86=99=E9=81=93:
On Sat 30-10-21 00:12:53, ni=
ng zhang wrote:
=E5=9C=A8 2021/10/29 =E4=B8=8B=E5=8D=889:38, Michal Hocko =E5=86=99=E9=81=
=93:
On Thu 28-10-21 19:56:49=
, Ning Zhang wrote:
As we know, thp may le=
ad to memory bloat which may cause OOM.
Through testing with some apps, we found that the reason of
memory bloat is a huge page may contain some zero subpages
(may accessed or not). And we found that most zero subpages
are centralized in a few huge pages.

Following is a text_classification_rnn case for tensorflow:

   zero_subpages   huge_pages  waste
   [     0,     1) 186         0.00%
   [     1,     2) 23          0.01%
   [     2,     4) 36          0.02%
   [     4,     8) 67          0.08%
   [     8,    16) 80          0.23%
   [    16,    32) 109         0.61%
   [    32,    64) 44          0.49%
   [    64,   128) 12          0.30%
   [   128,   256) 28          1.54%
   [   256,   513) 159        18.03%

In the case, there are 187 huge pages (25% of the total huge pages)
which contain more then 128 zero subpages. And these huge pages
lead to 19.57% waste of the total rss. It means we can reclaim
19.57% memory by splitting the 187 huge pages and reclaiming the
zero subpages.
What is the THP policy c=
onfiguration in your testing? I assume you are
using defaults right? That would be always for THP and madvise for
defrag. Would it make more sense to use madvise mode for THP for your
workload? The THP code is rather complex and just by looking at the
diffstat this add quite a lot on top. Is this really worth it?
The THP configuration is always.

Madvise needs users to set MADV_HUGEPAGE by themselves if they want use h=
uge
page, while many users don't do set this, and they can't control this wel=
l.
What do you mean tey can't control this well? 

I means they don't know where they should use THP.

And even if they use madvise, memory bloat still exists.


Such as java, users can se=
t heap and metaspace to use huge pages with
madvise, but there is also memory bloat. Users still need to test whether
their app can accept the waste.
There will always be some internal fragmentation when huge pages are
used. The amount will depend on how well the memory is used but huge
pages give a performance boost in return.

If the memory bloat is a significant problem then overeager THP usage is
certainly not good and I would argue that applying THP always policy is
not a proper configuration. No matter how much the MM code can try to
fix up the situation it will be always a catch up game.
=20
For the case above, if we =
set THP configuration to be madvise, all the pages
it uses will be 4K-page.

Memory bloat is one of the most important reasons that users disable THP.=
=C2=A0
We do this to popularize THP to be default enabled.
To my knowledge the most popular reason to disable THP is the runtime
overhead. A large part of that overhead has been reduced by not doing
heavy compaction during the page fault allocations by default. Memory
overhead is certainly an important aspect as well but there is always
a possibility to reduce that by reducing it to madvised regions for
page fault (i.e. those where author of the code has considered the
costs vs. benefits of the huge page) and setting up a conservative
khugepaged policy. So there are existing tools available. You are trying
to add quite a lot of code so you should have good arguments to add more
complexity. I am not sure that popularizing THP is a strong one TBH.

Sorry for relpying late. For the compaction, we can set defrag
of THP to be defer or never, to avoid overhead produced by
direct reclaim. However, there are no way to reduce memory bloat. If the memory usage reach the limit, and we can't reclaim some
pages, the OOM will be triggered and the process will be killed.
Our patchest is to avoid OOM.

Much code is interface to control ZSR. And we will try to
reduce the complexity.

--------------BF8CF7B91C6EB91E6E819268--