From: Oleksandr Natalenko <oleksandr@natalenko.name>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: cgel.zte@gmail.com, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, corbet@lwn.net,
xu xin <xu.xin16@zte.com.cn>, Yang Yang <yang.yang29@zte.com.cn>,
Ran Xiaokai <ran.xiaokai@zte.com.cn>,
wangyong <wang.yong12@zte.com.cn>,
Yunkai Zhang <zhang.yunkai@zte.com.cn>,
Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v6] mm/ksm: introduce ksm_force for each process
Date: Fri, 13 May 2022 22:53:16 +0200 [thread overview]
Message-ID: <1817008.tdWV9SEqCh@natalenko.name> (raw)
In-Reply-To: <20220513133210.9dd0a4216bd8baaa1047562c@linux-foundation.org>
Hello.
On pátek 13. května 2022 22:32:10 CEST Andrew Morton wrote:
> On Fri, 13 May 2022 11:51:53 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote:
> > On pátek 13. května 2022 0:37:53 CEST Andrew Morton wrote:
> > > On Tue, 10 May 2022 15:30:36 +0200 Oleksandr Natalenko <oleksandr@natalenko.name> wrote:
> > >
> > > > > If ksm_force is set to 1, force all anonymous and 'qualified' VMAs
> > > > > of this mm to be involved in KSM scanning without explicitly calling
> > > > > madvise to mark VMA as MADV_MERGEABLE. But It is effective only when
> > > > > the klob of /sys/kernel/mm/ksm/run is set as 1.
> > > > >
> > > > > If ksm_force is set to 0, cancel the feature of ksm_force of this
> > > > > process (fallback to the default state) and unmerge those merged pages
> > > > > belonging to VMAs which is not madvised as MADV_MERGEABLE of this process,
> > > > > but still leave MADV_MERGEABLE areas merged.
> > > >
> > > > To my best knowledge, last time a forcible KSM was discussed (see threads [1], [2], [3] and probably others) it was concluded that a) procfs was a horrible interface for things like this one; and b) process_madvise() syscall was among the best suggested places to implement this (which would require a more tricky handling from userspace, but still).
> > > >
> > > > So, what changed since that discussion?
> > > >
> > > > P.S. For now I do it via dedicated syscall, but I'm not trying to upstream this approach.
> > >
> > > Why are you patching the kernel with a new syscall rather than using
> > > process_madvise()?
> >
> > Because I'm not sure how to use `process_madvise()` to achieve $subj properly.
> >
> > The objective is to mark all the eligible VMAs of the target task for KSM to consider them for merging.
> >
> > For that, all the eligible VMAs have to be traversed.
> >
> > Given `process_madvise()` has got an iovec API, this means the process that will call `process_madvise()` has to know the list of VMAs of the target process. In order to traverse them in a race-free manner the target task has to be SIGSTOP'ped or frozen, then the list of VMAs has to be obtained, then `process_madvise()` has to be called, and the the target task can continue. This is:
> >
> > a) superfluous (the kernel already knows the list of VMAs of the target tasks, why proxy it through the userspace then?); and
> > b) may induce more latency than needed because the target task has to be stopped to avoid races.
>
> OK. And what happens to new vmas that the target process creates after
> the process_madvise()?
Call `process_madvise()` on them again. And do that again. Regularly, with some intervals. Use some daemon for that (like [1]).
[1] https://gitlab.com/post-factum/uksmd
> > OTOH, IIUC, even if `MADV_MERGEABLE` is allowed for `process_madvise()`,
>
> Is it not?
It is not:
```
1158 static bool
1159 process_madvise_behavior_valid(int behavior)
1160 {
1161 switch (behavior) {
1162 case MADV_COLD:
1163 case MADV_PAGEOUT:
1164 case MADV_WILLNEED:
1165 return true;
1166 default:
1167 return false;
1168 }
1169 }
```
Initially, when `process_madvise()` stuff was being prepared for merging, I tried to enabled it but failed [2], and it was decided [3] to move forward without it.
[2] https://lore.kernel.org/linux-api/34f812b8-df54-eaad-5cf0-335f07da55c6@suse.cz/
[3] https://lore.kernel.org/lkml/20200623085944.cvob63vrv54fo7cs@butterfly.localdomain/
> > I cannot just call it like this:
> >
> > ```
> > iovec.iov_base = 0;
> > iovec.iov_len = ~0ULL;
> > process_madvise(pidfd, &iovec, 1, MADV_MERGEABLE, 0);
> > ```
> >
> > to cover the whole address space because iovec expects total size to be under ssize_t.
> >
> > Or maybe there's no need to cover the whole address space, only the lower half of it?
>
> Call process_madvise() twice, once for each half?
And still get `-ENOMEM`?
```
1191 /*
1192 * If the interval [start,end) covers some unmapped address
1193 * ranges, just ignore them, but return -ENOMEM at the end.
1194 * - different from the way of handling in mlock etc.
1195 */
```
I mean, it probably will work, and probably having the error returned is fine, but generally speaking an error value should hint that something is not being done right.
> > Or maybe there's another way of doing things, and I just look stupid and do not understand how this is supposed to work?..
> >
> > I'm more than happy to read your comments on this.
> >
>
> I see the problem. I do like the simplicity of the ksm_force concept.
> Are there alternative ideas?
I do like it too. I wonder what to do with older concerns [4] [5] regarding presenting such an API.
[4] https://lore.kernel.org/lkml/20190516172452.GA2106@avx2/
[5] https://lore.kernel.org/lkml/20190515145151.GG16651@dhcp22.suse.cz/
--
Oleksandr Natalenko (post-factum)
next prev parent reply other threads:[~2022-05-13 20:53 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-10 12:22 [PATCH v6] mm/ksm: introduce ksm_force for each process cgel.zte
2022-05-10 13:30 ` Oleksandr Natalenko
2022-05-11 3:12 ` CGEL
2022-05-12 22:37 ` Andrew Morton
2022-05-13 9:51 ` Oleksandr Natalenko
2022-05-13 20:32 ` Andrew Morton
2022-05-13 20:53 ` Oleksandr Natalenko [this message]
2022-05-12 21:07 ` Matthew Wilcox
2022-05-13 6:50 ` CGEL
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1817008.tdWV9SEqCh@natalenko.name \
--to=oleksandr@natalenko.name \
--cc=akpm@linux-foundation.org \
--cc=cgel.zte@gmail.com \
--cc=corbet@lwn.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ran.xiaokai@zte.com.cn \
--cc=wang.yong12@zte.com.cn \
--cc=willy@infradead.org \
--cc=xu.xin16@zte.com.cn \
--cc=yang.yang29@zte.com.cn \
--cc=zhang.yunkai@zte.com.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).