linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC]: userspace memory reaping
@ 2020-09-15  0:43 Suren Baghdasaryan
  2020-09-15  0:45 ` Suren Baghdasaryan
  0 siblings, 1 reply; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-09-15  0:43 UTC (permalink / raw)
  To: linux-api, linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team

Last year I sent an RFC about using oom-reaper while killing a
process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019
discussion https://lwn.net/Articles/787217 a couple of alternative
options were discussed with the most promising one (outlined in the
last paragraph of https://lwn.net/Articles/787217) suggesting to use a
remote version of madvise(MADV_DONTNEED) operation to force memory
reclaim of a killed process. With process_madvise() making its way
through reviews (https://patchwork.kernel.org/patch/11747133/), I
would like to revive this discussion and get feedback on several
possible options, their pros and cons.

The need is similar to why oom-reaper was introduced - when a process
is being killed to free memory we want to make sure memory is freed
even if the victim is in uninterruptible sleep or is busy and reaction
to SIGKILL is delayed by an unpredictable amount of time. I
experimented with enabling process_madvise(MADV_DONTNEED) operation
and using it to force memory reclaim of the target process after
sending SIGKILL. Unfortunately this approach requires the caller to
read proc/pid/maps to extract the list of VMAs to pass as an input to
process_madvise(). This is a time consuming operation. I measured
times similar to what Minchan indicated in
https://lore.kernel.org/linux-mm/20190528032632.GF6879@google.com/ and
the reason reading proc/pid/maps consumes that much time is the number
of read syscalls required to read this file. proc/pid/maps file, being
a seq_file, can be read in chunks of up to 4096 bytes (1 page). Even
if userspace provides bigger buffer, only up to 4096 bytes will be
read with one syscall. Measured on Qualcomm® Snapdragon 855™ using its
Big core of 2.84GHz a single read syscall takes between 50 and 200us
(in case there was no contention on mmap_sem or some other lock during
the syscall). Taking one typical example from my tests, a 219232 bytes
long proc/pid/maps file describing 1623 VMAs required 55 read
syscalls. With mmap_sem contention proc/pid/maps read can take even
longer. In my tests I measured typical delays of 3-7ms with occasional
delays of up to 20ms when a read syscall was blocked and the process
got into uninterruptible sleep.

While the objective is to guarantee forward progress even when the
victim cannot terminate, we still want this mechanism to be efficient
because we perform these operations to relieve memory pressure before
it affects user experience.

Alternative options I would like your feedback are:
1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM)
specifically for this case to indicate that the whole mm can be freed.
2. A new syscall to efficiently obtain a vector of VMAs (start,
length, flags) of the process instead of reading /proc/pid/maps. The
size of the vector is still limited by UIO_MAXIOV (1024), so several
calls might be needed to query larger number of VMAs, however it will
still be an order of magnitude more efficient than reading
/proc/pid/maps file in 4K or smaller chunks.
3. Use process_madvise() flags parameter to indicate a bulk operation
which ignores input vectors. Sample usage: process_madvise(pidfd,
MADV_DONTNEED, vector=NULL, vlen=0, flags=PMADV_FLAG_FILE |
PMADV_FLAG_ANON);
4. madvise()/process_madvise() handle gaps between VMAs, so we could
provide one vector element spanning the entire address space. There
are technical issues with this approach (process_madvise return value
can't handle such a large number of bytes and there is MAX_RW_COUNT
limit on max number of bytes one process_madvise call can handle) but
I would still like to hear opinions about it. If this option is
preferable maybe we can deal with these limitations.

We can also go back to reclaiming victim's memory asynchronously but
synchronous method has the following advantages:
- reaping will be performed in the caller's context and therefore with
caller's priority, CPU affinity, CPU bandwidth, reaping workload will
be charged to the caller and accounted for.
- reaping is a blocking/synchronous operation for the caller, so when
it's finished, the caller can be sure mm is freed (or almost freed
considering lazy freeing and batching mechanisms) and it can reassess
the memory conditions right away.
- for very large MMs (not really my case) caller could split the VMA
vector and perform reaping from multiple threads to make it faster.
This would not be possible with options (1) and (3).

Would really appreciate your feedback on these options for future development.
Thanks,
Suren.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-09-15  0:43 [RFC]: userspace memory reaping Suren Baghdasaryan
@ 2020-09-15  0:45 ` Suren Baghdasaryan
  2020-10-14 12:09   ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-09-15  0:45 UTC (permalink / raw)
  To: linux-api, linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML

+ linux-kernel@vger.kernel.org

On Mon, Sep 14, 2020 at 5:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Last year I sent an RFC about using oom-reaper while killing a
> process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019
> discussion https://lwn.net/Articles/787217 a couple of alternative
> options were discussed with the most promising one (outlined in the
> last paragraph of https://lwn.net/Articles/787217) suggesting to use a
> remote version of madvise(MADV_DONTNEED) operation to force memory
> reclaim of a killed process. With process_madvise() making its way
> through reviews (https://patchwork.kernel.org/patch/11747133/), I
> would like to revive this discussion and get feedback on several
> possible options, their pros and cons.
>
> The need is similar to why oom-reaper was introduced - when a process
> is being killed to free memory we want to make sure memory is freed
> even if the victim is in uninterruptible sleep or is busy and reaction
> to SIGKILL is delayed by an unpredictable amount of time. I
> experimented with enabling process_madvise(MADV_DONTNEED) operation
> and using it to force memory reclaim of the target process after
> sending SIGKILL. Unfortunately this approach requires the caller to
> read proc/pid/maps to extract the list of VMAs to pass as an input to
> process_madvise(). This is a time consuming operation. I measured
> times similar to what Minchan indicated in
> https://lore.kernel.org/linux-mm/20190528032632.GF6879@google.com/ and
> the reason reading proc/pid/maps consumes that much time is the number
> of read syscalls required to read this file. proc/pid/maps file, being
> a seq_file, can be read in chunks of up to 4096 bytes (1 page). Even
> if userspace provides bigger buffer, only up to 4096 bytes will be
> read with one syscall. Measured on Qualcomm® Snapdragon 855™ using its
> Big core of 2.84GHz a single read syscall takes between 50 and 200us
> (in case there was no contention on mmap_sem or some other lock during
> the syscall). Taking one typical example from my tests, a 219232 bytes
> long proc/pid/maps file describing 1623 VMAs required 55 read
> syscalls. With mmap_sem contention proc/pid/maps read can take even
> longer. In my tests I measured typical delays of 3-7ms with occasional
> delays of up to 20ms when a read syscall was blocked and the process
> got into uninterruptible sleep.
>
> While the objective is to guarantee forward progress even when the
> victim cannot terminate, we still want this mechanism to be efficient
> because we perform these operations to relieve memory pressure before
> it affects user experience.
>
> Alternative options I would like your feedback are:
> 1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM)
> specifically for this case to indicate that the whole mm can be freed.
> 2. A new syscall to efficiently obtain a vector of VMAs (start,
> length, flags) of the process instead of reading /proc/pid/maps. The
> size of the vector is still limited by UIO_MAXIOV (1024), so several
> calls might be needed to query larger number of VMAs, however it will
> still be an order of magnitude more efficient than reading
> /proc/pid/maps file in 4K or smaller chunks.
> 3. Use process_madvise() flags parameter to indicate a bulk operation
> which ignores input vectors. Sample usage: process_madvise(pidfd,
> MADV_DONTNEED, vector=NULL, vlen=0, flags=PMADV_FLAG_FILE |
> PMADV_FLAG_ANON);
> 4. madvise()/process_madvise() handle gaps between VMAs, so we could
> provide one vector element spanning the entire address space. There
> are technical issues with this approach (process_madvise return value
> can't handle such a large number of bytes and there is MAX_RW_COUNT
> limit on max number of bytes one process_madvise call can handle) but
> I would still like to hear opinions about it. If this option is
> preferable maybe we can deal with these limitations.
>
> We can also go back to reclaiming victim's memory asynchronously but
> synchronous method has the following advantages:
> - reaping will be performed in the caller's context and therefore with
> caller's priority, CPU affinity, CPU bandwidth, reaping workload will
> be charged to the caller and accounted for.
> - reaping is a blocking/synchronous operation for the caller, so when
> it's finished, the caller can be sure mm is freed (or almost freed
> considering lazy freeing and batching mechanisms) and it can reassess
> the memory conditions right away.
> - for very large MMs (not really my case) caller could split the VMA
> vector and perform reaping from multiple threads to make it faster.
> This would not be possible with options (1) and (3).
>
> Would really appreciate your feedback on these options for future development.
> Thanks,
> Suren.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-09-15  0:45 ` Suren Baghdasaryan
@ 2020-10-14 12:09   ` Michal Hocko
  2020-10-14 16:57     ` Suren Baghdasaryan
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2020-10-14 12:09 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML

[Sorry for a late reply]

On Mon 14-09-20 17:45:44, Suren Baghdasaryan wrote:
> + linux-kernel@vger.kernel.org
> 
> On Mon, Sep 14, 2020 at 5:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Last year I sent an RFC about using oom-reaper while killing a
> > process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019
> > discussion https://lwn.net/Articles/787217 a couple of alternative
> > options were discussed with the most promising one (outlined in the
> > last paragraph of https://lwn.net/Articles/787217) suggesting to use a
> > remote version of madvise(MADV_DONTNEED) operation to force memory
> > reclaim of a killed process. With process_madvise() making its way
> > through reviews (https://patchwork.kernel.org/patch/11747133/), I
> > would like to revive this discussion and get feedback on several
> > possible options, their pros and cons.

Thanks for reviving this!

> > The need is similar to why oom-reaper was introduced - when a process
> > is being killed to free memory we want to make sure memory is freed
> > even if the victim is in uninterruptible sleep or is busy and reaction
> > to SIGKILL is delayed by an unpredictable amount of time. I
> > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > and using it to force memory reclaim of the target process after
> > sending SIGKILL. Unfortunately this approach requires the caller to
> > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > process_madvise().

Well I would argue that this is not really necessary. You can simply
call process_madvise with the full address range and let the kernel
operated only on ranges which are safe to tear down asynchronously.
Sure that would require some changes to the existing code to not fail
on those ranges if they contain incompatible vmas but that should be
possible. If we are worried about backward compatibility then a
dedicated flag could override.

[...]

> > While the objective is to guarantee forward progress even when the
> > victim cannot terminate, we still want this mechanism to be efficient
> > because we perform these operations to relieve memory pressure before
> > it affects user experience.
> >
> > Alternative options I would like your feedback are:
> > 1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM)
> > specifically for this case to indicate that the whole mm can be freed.

This shouldn't be any different from madvise on the full address range,
right?

> > 2. A new syscall to efficiently obtain a vector of VMAs (start,
> > length, flags) of the process instead of reading /proc/pid/maps. The
> > size of the vector is still limited by UIO_MAXIOV (1024), so several
> > calls might be needed to query larger number of VMAs, however it will
> > still be an order of magnitude more efficient than reading
> > /proc/pid/maps file in 4K or smaller chunks.

While this might be interesting for other usecases - userspace memory
management in general - I do not think it is directly related to this
particular feature.

> > 3. Use process_madvise() flags parameter to indicate a bulk operation
> > which ignores input vectors. Sample usage: process_madvise(pidfd,
> > MADV_DONTNEED, vector=NULL, vlen=0, flags=PMADV_FLAG_FILE |
> > PMADV_FLAG_ANON);

Similar to above.

> > 4. madvise()/process_madvise() handle gaps between VMAs, so we could
> > provide one vector element spanning the entire address space. There
> > are technical issues with this approach (process_madvise return value
> > can't handle such a large number of bytes and there is MAX_RW_COUNT
> > limit on max number of bytes one process_madvise call can handle) but
> > I would still like to hear opinions about it. If this option is
> > preferable maybe we can deal with these limitations.

To be really honest, the more I am thinking about remove MADV_DONTNEED
the less I like it. Sure we can limit this functionality to killed tasks
but there is still a need to MMF_UNSTABLE that the current oom reaper
sets to prevent from memory corruption while the kernel is still in
kernel. Userspace memory reaper would need something similar.

I do have a vague recollection that we have discussed a kill(2) based
approach as well in the past. Essentially SIG_KILL_SYNC which would
not only send the signal but it would start a teardown of resources
owned by the task - at least those we can remove safely. The interface
would be much more simple and less tricky to use. You just make your
userspace oom killer or potentially other users call SIG_KILL_SYNC which
will be more expensive but you would at least know that as many
resources have been freed as the kernel can afford at the moment.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-10-14 12:09   ` Michal Hocko
@ 2020-10-14 16:57     ` Suren Baghdasaryan
  2020-10-14 18:39       ` minchan
  2020-10-15  9:20       ` Michal Hocko
  0 siblings, 2 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-10-14 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> [Sorry for a late reply]
>
> On Mon 14-09-20 17:45:44, Suren Baghdasaryan wrote:
> > + linux-kernel@vger.kernel.org
> >
> > On Mon, Sep 14, 2020 at 5:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > Last year I sent an RFC about using oom-reaper while killing a
> > > process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019
> > > discussion https://lwn.net/Articles/787217 a couple of alternative
> > > options were discussed with the most promising one (outlined in the
> > > last paragraph of https://lwn.net/Articles/787217) suggesting to use a
> > > remote version of madvise(MADV_DONTNEED) operation to force memory
> > > reclaim of a killed process. With process_madvise() making its way
> > > through reviews (https://patchwork.kernel.org/patch/11747133/), I
> > > would like to revive this discussion and get feedback on several
> > > possible options, their pros and cons.
>
> Thanks for reviving this!

Thanks for your feedback!

>
> > > The need is similar to why oom-reaper was introduced - when a process
> > > is being killed to free memory we want to make sure memory is freed
> > > even if the victim is in uninterruptible sleep or is busy and reaction
> > > to SIGKILL is delayed by an unpredictable amount of time. I
> > > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > > and using it to force memory reclaim of the target process after
> > > sending SIGKILL. Unfortunately this approach requires the caller to
> > > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > > process_madvise().
>
> Well I would argue that this is not really necessary. You can simply
> call process_madvise with the full address range and let the kernel
> operated only on ranges which are safe to tear down asynchronously.
> Sure that would require some changes to the existing code to not fail
> on those ranges if they contain incompatible vmas but that should be
> possible. If we are worried about backward compatibility then a
> dedicated flag could override.
>

IIUC this is very similar to the last option I proposed. I think this
is doable if we treat it as a special case. process_madvise() return
value not being able to handle a large range would still be a problem.
Maybe we can return MAX_INT in those cases?

> [...]
>
> > > While the objective is to guarantee forward progress even when the
> > > victim cannot terminate, we still want this mechanism to be efficient
> > > because we perform these operations to relieve memory pressure before
> > > it affects user experience.
> > >
> > > Alternative options I would like your feedback are:
> > > 1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM)
> > > specifically for this case to indicate that the whole mm can be freed.
>
> This shouldn't be any different from madvise on the full address range,
> right?
>

Yep, just a matter of choosing the most appropriate API.

> > > 2. A new syscall to efficiently obtain a vector of VMAs (start,
> > > length, flags) of the process instead of reading /proc/pid/maps. The
> > > size of the vector is still limited by UIO_MAXIOV (1024), so several
> > > calls might be needed to query larger number of VMAs, however it will
> > > still be an order of magnitude more efficient than reading
> > > /proc/pid/maps file in 4K or smaller chunks.
>
> While this might be interesting for other usecases - userspace memory
> management in general - I do not think it is directly related to this
> particular feature.
>

True but such a syscall would be useful for other use cases, like
MADV_COLD/MADV_PAGEOUT that Minchan was working on. Maybe we can kill
more than one bird here? Minchan, any thought?

> > > 3. Use process_madvise() flags parameter to indicate a bulk operation
> > > which ignores input vectors. Sample usage: process_madvise(pidfd,
> > > MADV_DONTNEED, vector=NULL, vlen=0, flags=PMADV_FLAG_FILE |
> > > PMADV_FLAG_ANON);
>
> Similar to above.
>

Similar to option 1 I suppose. If so, I agree, just a matter of choosing API.

> > > 4. madvise()/process_madvise() handle gaps between VMAs, so we could
> > > provide one vector element spanning the entire address space. There
> > > are technical issues with this approach (process_madvise return value
> > > can't handle such a large number of bytes and there is MAX_RW_COUNT
> > > limit on max number of bytes one process_madvise call can handle) but
> > > I would still like to hear opinions about it. If this option is
> > > preferable maybe we can deal with these limitations.
>
> To be really honest, the more I am thinking about remove MADV_DONTNEED
> the less I like it. Sure we can limit this functionality to killed tasks
> but there is still a need to MMF_UNSTABLE that the current oom reaper
> sets to prevent from memory corruption while the kernel is still in
> kernel. Userspace memory reaper would need something similar.
>
> I do have a vague recollection that we have discussed a kill(2) based
> approach as well in the past. Essentially SIG_KILL_SYNC which would
> not only send the signal but it would start a teardown of resources
> owned by the task - at least those we can remove safely. The interface
> would be much more simple and less tricky to use. You just make your
> userspace oom killer or potentially other users call SIG_KILL_SYNC which
> will be more expensive but you would at least know that as many
> resources have been freed as the kernel can afford at the moment.

Correct, my early RFC here
https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@google.com
was using a new flag for pidfd_send_signal() to request mm reaping by
oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
signal instead of a new pidfd_send_signal() flag and otherwise a very
similar solution. Is my understanding correct?

I remember Mel Gorman (who I forgot to CC in my original email and
added now) and Matthew Wilcox were actively participating in that
discussion during LSFMM. Would love to hear their opinions before
jumping into development.

Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-10-14 16:57     ` Suren Baghdasaryan
@ 2020-10-14 18:39       ` minchan
  2020-10-15  9:20       ` Michal Hocko
  1 sibling, 0 replies; 26+ messages in thread
From: minchan @ 2020-10-14 18:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Wed, Oct 14, 2020 at 09:57:20AM -0700, Suren Baghdasaryan wrote:
> On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [Sorry for a late reply]
> >
> > On Mon 14-09-20 17:45:44, Suren Baghdasaryan wrote:
> > > + linux-kernel@vger.kernel.org
> > >
> > > On Mon, Sep 14, 2020 at 5:43 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > Last year I sent an RFC about using oom-reaper while killing a
> > > > process: https://patchwork.kernel.org/cover/10894999. During LSFMM2019
> > > > discussion https://lwn.net/Articles/787217 a couple of alternative
> > > > options were discussed with the most promising one (outlined in the
> > > > last paragraph of https://lwn.net/Articles/787217) suggesting to use a
> > > > remote version of madvise(MADV_DONTNEED) operation to force memory
> > > > reclaim of a killed process. With process_madvise() making its way
> > > > through reviews (https://patchwork.kernel.org/patch/11747133/), I
> > > > would like to revive this discussion and get feedback on several
> > > > possible options, their pros and cons.
> >
> > Thanks for reviving this!
> 
> Thanks for your feedback!
> 
> >
> > > > The need is similar to why oom-reaper was introduced - when a process
> > > > is being killed to free memory we want to make sure memory is freed
> > > > even if the victim is in uninterruptible sleep or is busy and reaction
> > > > to SIGKILL is delayed by an unpredictable amount of time. I
> > > > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > > > and using it to force memory reclaim of the target process after
> > > > sending SIGKILL. Unfortunately this approach requires the caller to
> > > > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > > > process_madvise().
> >
> > Well I would argue that this is not really necessary. You can simply
> > call process_madvise with the full address range and let the kernel
> > operated only on ranges which are safe to tear down asynchronously.
> > Sure that would require some changes to the existing code to not fail
> > on those ranges if they contain incompatible vmas but that should be
> > possible. If we are worried about backward compatibility then a
> > dedicated flag could override.
> >
> 
> IIUC this is very similar to the last option I proposed. I think this
> is doable if we treat it as a special case. process_madvise() return
> value not being able to handle a large range would still be a problem.
> Maybe we can return MAX_INT in those cases?

Or, maybe we could just return 0 if the operation succeeds without any
error.

> 
> > [...]
> >
> > > > While the objective is to guarantee forward progress even when the
> > > > victim cannot terminate, we still want this mechanism to be efficient
> > > > because we perform these operations to relieve memory pressure before
> > > > it affects user experience.
> > > >
> > > > Alternative options I would like your feedback are:
> > > > 1. Introduce a dedicated process_madvise(MADV_DONTNEED_MM)
> > > > specifically for this case to indicate that the whole mm can be freed.
> >
> > This shouldn't be any different from madvise on the full address range,
> > right?
> >
> 
> Yep, just a matter of choosing the most appropriate API.


I agree full range or just NULL passing to indicate entire address
space would be better than introducing a new advise in that we could
avoid MADV_PAGEOUT_MM, MADV_COLD_MM.

> 
> > > > 2. A new syscall to efficiently obtain a vector of VMAs (start,
> > > > length, flags) of the process instead of reading /proc/pid/maps. The
> > > > size of the vector is still limited by UIO_MAXIOV (1024), so several
> > > > calls might be needed to query larger number of VMAs, however it will
> > > > still be an order of magnitude more efficient than reading
> > > > /proc/pid/maps file in 4K or smaller chunks.
> >
> > While this might be interesting for other usecases - userspace memory
> > management in general - I do not think it is directly related to this
> > particular feature.
> >
> 
> True but such a syscall would be useful for other use cases, like
> MADV_COLD/MADV_PAGEOUT that Minchan was working on. Maybe we can kill
> more than one bird here? Minchan, any thought?

Generally, it could be helpful but I don't see it as desperate at this
moment.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-10-14 16:57     ` Suren Baghdasaryan
  2020-10-14 18:39       ` minchan
@ 2020-10-15  9:20       ` Michal Hocko
  2020-10-15 18:43         ` Minchan Kim
  2020-10-15 19:25         ` Suren Baghdasaryan
  1 sibling, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2020-10-15  9:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Wed 14-10-20 09:57:20, Suren Baghdasaryan wrote:
> On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > > The need is similar to why oom-reaper was introduced - when a process
> > > > is being killed to free memory we want to make sure memory is freed
> > > > even if the victim is in uninterruptible sleep or is busy and reaction
> > > > to SIGKILL is delayed by an unpredictable amount of time. I
> > > > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > > > and using it to force memory reclaim of the target process after
> > > > sending SIGKILL. Unfortunately this approach requires the caller to
> > > > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > > > process_madvise().
> >
> > Well I would argue that this is not really necessary. You can simply
> > call process_madvise with the full address range and let the kernel
> > operated only on ranges which are safe to tear down asynchronously.
> > Sure that would require some changes to the existing code to not fail
> > on those ranges if they contain incompatible vmas but that should be
> > possible. If we are worried about backward compatibility then a
> > dedicated flag could override.
> >
> 
> IIUC this is very similar to the last option I proposed. I think this
> is doable if we treat it as a special case. process_madvise() return
> value not being able to handle a large range would still be a problem.
> Maybe we can return MAX_INT in those cases?

madvise is documented to return 
       On success, madvise() returns zero.  On error, it returns -1 and
       errno is set appropriately.
[...]
NOTES
   Linux notes
       The Linux implementation requires that the address addr be
       page-aligned, and allows length to be zero.  If there are some
       parts of the specified address range that are not mapped, the
       Linux version of madvise() ignores them and applies the call to
       the rest (but returns ENOMEM from the system call, as it should).

I have learned about ENOMEM case only now. And it seems this is indeed
what we are implementing. So if we want to add a new mode to
opportunistically attempt madvise on the whole given range without a
failure then we need a specific flag for that. Advice is a number rather
than a bitmask but (ab)using the top bit or use negative number space
(e.g. -MADV_DONTNEED) for that sounds possible albeit bit hackish.

[...]
> > I do have a vague recollection that we have discussed a kill(2) based
> > approach as well in the past. Essentially SIG_KILL_SYNC which would
> > not only send the signal but it would start a teardown of resources
> > owned by the task - at least those we can remove safely. The interface
> > would be much more simple and less tricky to use. You just make your
> > userspace oom killer or potentially other users call SIG_KILL_SYNC which
> > will be more expensive but you would at least know that as many
> > resources have been freed as the kernel can afford at the moment.
> 
> Correct, my early RFC here
> https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@google.com
> was using a new flag for pidfd_send_signal() to request mm reaping by
> oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
> signal instead of a new pidfd_send_signal() flag and otherwise a very
> similar solution. Is my understanding correct?

Well, I think you shouldn't focus too much on the oom-reaper aspect
of it. Sure it can be used for that but I believe that a new signal
should provide a sync behavior. People more familiar with the process
management would be better off defining what is possible for a new sync
signal.  Ideally not only pro-active process destruction but also sync
waiting until the target process is released so that you know that once
kill syscall returns the process is gone.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-10-15  9:20       ` Michal Hocko
@ 2020-10-15 18:43         ` Minchan Kim
  2020-10-15 19:32           ` Suren Baghdasaryan
  2020-10-15 19:25         ` Suren Baghdasaryan
  1 sibling, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2020-10-15 18:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, linux-api, linux-mm, Andrew Morton,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Thu, Oct 15, 2020 at 11:20:30AM +0200, Michal Hocko wrote:

> > > I do have a vague recollection that we have discussed a kill(2) based
> > > approach as well in the past. Essentially SIG_KILL_SYNC which would
> > > not only send the signal but it would start a teardown of resources
> > > owned by the task - at least those we can remove safely. The interface
> > > would be much more simple and less tricky to use. You just make your
> > > userspace oom killer or potentially other users call SIG_KILL_SYNC which
> > > will be more expensive but you would at least know that as many
> > > resources have been freed as the kernel can afford at the moment.
> > 
> > Correct, my early RFC here
> > https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@google.com
> > was using a new flag for pidfd_send_signal() to request mm reaping by
> > oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
> > signal instead of a new pidfd_send_signal() flag and otherwise a very
> > similar solution. Is my understanding correct?
> 
> Well, I think you shouldn't focus too much on the oom-reaper aspect
> of it. Sure it can be used for that but I believe that a new signal
> should provide a sync behavior. People more familiar with the process
> management would be better off defining what is possible for a new sync
> signal.  Ideally not only pro-active process destruction but also sync
> waiting until the target process is released so that you know that once
> kill syscall returns the process is gone.

If we approach with signal, I am not sure we need to create new signal
rather than pidfd and fsync(2) semantic.

Furthermore, process_madvise makes the work in the caller context but
signal might work somewhere else context depending on implemenation(
oom reaper or CPU resumed the task). I am not sure it it fulfils Suren's
requirement.

One more thing to think over: Even though we spent some overhead to
read /proc/pid/maps, we could make zapping in parallel in userspace
with multi thread approach. I am not sure what's the win since Suren
also care about zapping performance.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-10-15  9:20       ` Michal Hocko
  2020-10-15 18:43         ` Minchan Kim
@ 2020-10-15 19:25         ` Suren Baghdasaryan
  2020-11-02 20:29           ` Suren Baghdasaryan
  1 sibling, 1 reply; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-10-15 19:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Thu, Oct 15, 2020 at 2:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 14-10-20 09:57:20, Suren Baghdasaryan wrote:
> > On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > > The need is similar to why oom-reaper was introduced - when a process
> > > > > is being killed to free memory we want to make sure memory is freed
> > > > > even if the victim is in uninterruptible sleep or is busy and reaction
> > > > > to SIGKILL is delayed by an unpredictable amount of time. I
> > > > > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > > > > and using it to force memory reclaim of the target process after
> > > > > sending SIGKILL. Unfortunately this approach requires the caller to
> > > > > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > > > > process_madvise().
> > >
> > > Well I would argue that this is not really necessary. You can simply
> > > call process_madvise with the full address range and let the kernel
> > > operated only on ranges which are safe to tear down asynchronously.
> > > Sure that would require some changes to the existing code to not fail
> > > on those ranges if they contain incompatible vmas but that should be
> > > possible. If we are worried about backward compatibility then a
> > > dedicated flag could override.
> > >
> >
> > IIUC this is very similar to the last option I proposed. I think this
> > is doable if we treat it as a special case. process_madvise() return
> > value not being able to handle a large range would still be a problem.
> > Maybe we can return MAX_INT in those cases?
>
> madvise is documented to return
>        On success, madvise() returns zero.  On error, it returns -1 and
>        errno is set appropriately.
> [...]
> NOTES
>    Linux notes
>        The Linux implementation requires that the address addr be
>        page-aligned, and allows length to be zero.  If there are some
>        parts of the specified address range that are not mapped, the
>        Linux version of madvise() ignores them and applies the call to
>        the rest (but returns ENOMEM from the system call, as it should).
>
> I have learned about ENOMEM case only now. And it seems this is indeed
> what we are implementing. So if we want to add a new mode to
> opportunistically attempt madvise on the whole given range without a
> failure then we need a specific flag for that. Advice is a number rather
> than a bitmask but (ab)using the top bit or use negative number space
> (e.g. -MADV_DONTNEED) for that sounds possible albeit bit hackish.

process_madvise() has an additional flag parameter. Why not have a
separate flag to denote that we want to just skip VMA gaps and proceed
without error? Something like MADVF_SKIP_GAPS?

>
> [...]
> > > I do have a vague recollection that we have discussed a kill(2) based
> > > approach as well in the past. Essentially SIG_KILL_SYNC which would
> > > not only send the signal but it would start a teardown of resources
> > > owned by the task - at least those we can remove safely. The interface
> > > would be much more simple and less tricky to use. You just make your
> > > userspace oom killer or potentially other users call SIG_KILL_SYNC which
> > > will be more expensive but you would at least know that as many
> > > resources have been freed as the kernel can afford at the moment.
> >
> > Correct, my early RFC here
> > https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@google.com
> > was using a new flag for pidfd_send_signal() to request mm reaping by
> > oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
> > signal instead of a new pidfd_send_signal() flag and otherwise a very
> > similar solution. Is my understanding correct?
>
> Well, I think you shouldn't focus too much on the oom-reaper aspect
> of it. Sure it can be used for that but I believe that a new signal
> should provide a sync behavior. People more familiar with the process
> management would be better off defining what is possible for a new sync
> signal.  Ideally not only pro-active process destruction but also sync
> waiting until the target process is released so that you know that once
> kill syscall returns the process is gone.

If your suggestion is for SIG_KILL_SYNC to perform victim's resource
cleanup in the context of the caller while the victim is in
uninterruptible sleep that would definitely be useful. I assume there
are some resources which can't be reclaimed until the process itself
wakes up and handles the SIGKILL. If so, I hope kill(SIG_KILL_SYNC)
would not have to wait for the victim to wake up and handle the
signal. This would really complicate the userspace in cases when we
just want to reclaim whatever we can without victim's involvement and
continue. For cases when waiting is required waitid() with P_PIDFD can
be used.
Would this semantic work?

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-10-15 18:43         ` Minchan Kim
@ 2020-10-15 19:32           ` Suren Baghdasaryan
  0 siblings, 0 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-10-15 19:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu, Oct 15, 2020 at 11:43 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Oct 15, 2020 at 11:20:30AM +0200, Michal Hocko wrote:
>
> > > > I do have a vague recollection that we have discussed a kill(2) based
> > > > approach as well in the past. Essentially SIG_KILL_SYNC which would
> > > > not only send the signal but it would start a teardown of resources
> > > > owned by the task - at least those we can remove safely. The interface
> > > > would be much more simple and less tricky to use. You just make your
> > > > userspace oom killer or potentially other users call SIG_KILL_SYNC which
> > > > will be more expensive but you would at least know that as many
> > > > resources have been freed as the kernel can afford at the moment.
> > >
> > > Correct, my early RFC here
> > > https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@google.com
> > > was using a new flag for pidfd_send_signal() to request mm reaping by
> > > oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
> > > signal instead of a new pidfd_send_signal() flag and otherwise a very
> > > similar solution. Is my understanding correct?
> >
> > Well, I think you shouldn't focus too much on the oom-reaper aspect
> > of it. Sure it can be used for that but I believe that a new signal
> > should provide a sync behavior. People more familiar with the process
> > management would be better off defining what is possible for a new sync
> > signal.  Ideally not only pro-active process destruction but also sync
> > waiting until the target process is released so that you know that once
> > kill syscall returns the process is gone.
>
> If we approach with signal, I am not sure we need to create new signal
> rather than pidfd and fsync(2) semantic.
>
> Furthermore, process_madvise makes the work in the caller context but
> signal might work somewhere else context depending on implemenation(
> oom reaper or CPU resumed the task). I am not sure it it fulfils Suren's
> requirement.
>
> One more thing to think over: Even though we spent some overhead to
> read /proc/pid/maps, we could make zapping in parallel in userspace
> with multi thread approach. I am not sure what's the win since Suren
> also care about zapping performance.

Sorry Minchan, I did not see your reply while replying to Michal...
Even if we do the reading/reaping in parallel, we still have to issue
10s of read() syscalls to consume the entire /proc/pid/maps file. Plus
I'm not sure how much mmap_sem contention such parallel operation
(reaping taking write lock and maps reading taking read lock) would
generate. If we go this route I think a syscall to read a vector of
VMAs would be way more performant and userspace usage would be much
simpler.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-10-15 19:25         ` Suren Baghdasaryan
@ 2020-11-02 20:29           ` Suren Baghdasaryan
  2020-11-03  9:35             ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-11-02 20:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Thu, Oct 15, 2020 at 12:25 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Oct 15, 2020 at 2:20 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 14-10-20 09:57:20, Suren Baghdasaryan wrote:
> > > On Wed, Oct 14, 2020 at 5:09 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > > The need is similar to why oom-reaper was introduced - when a process
> > > > > > is being killed to free memory we want to make sure memory is freed
> > > > > > even if the victim is in uninterruptible sleep or is busy and reaction
> > > > > > to SIGKILL is delayed by an unpredictable amount of time. I
> > > > > > experimented with enabling process_madvise(MADV_DONTNEED) operation
> > > > > > and using it to force memory reclaim of the target process after
> > > > > > sending SIGKILL. Unfortunately this approach requires the caller to
> > > > > > read proc/pid/maps to extract the list of VMAs to pass as an input to
> > > > > > process_madvise().
> > > >
> > > > Well I would argue that this is not really necessary. You can simply
> > > > call process_madvise with the full address range and let the kernel
> > > > operated only on ranges which are safe to tear down asynchronously.
> > > > Sure that would require some changes to the existing code to not fail
> > > > on those ranges if they contain incompatible vmas but that should be
> > > > possible. If we are worried about backward compatibility then a
> > > > dedicated flag could override.
> > > >
> > >
> > > IIUC this is very similar to the last option I proposed. I think this
> > > is doable if we treat it as a special case. process_madvise() return
> > > value not being able to handle a large range would still be a problem.
> > > Maybe we can return MAX_INT in those cases?
> >
> > madvise is documented to return
> >        On success, madvise() returns zero.  On error, it returns -1 and
> >        errno is set appropriately.
> > [...]
> > NOTES
> >    Linux notes
> >        The Linux implementation requires that the address addr be
> >        page-aligned, and allows length to be zero.  If there are some
> >        parts of the specified address range that are not mapped, the
> >        Linux version of madvise() ignores them and applies the call to
> >        the rest (but returns ENOMEM from the system call, as it should).
> >
> > I have learned about ENOMEM case only now. And it seems this is indeed
> > what we are implementing. So if we want to add a new mode to
> > opportunistically attempt madvise on the whole given range without a
> > failure then we need a specific flag for that. Advice is a number rather
> > than a bitmask but (ab)using the top bit or use negative number space
> > (e.g. -MADV_DONTNEED) for that sounds possible albeit bit hackish.
>
> process_madvise() has an additional flag parameter. Why not have a
> separate flag to denote that we want to just skip VMA gaps and proceed
> without error? Something like MADVF_SKIP_GAPS?
>
> >
> > [...]
> > > > I do have a vague recollection that we have discussed a kill(2) based
> > > > approach as well in the past. Essentially SIG_KILL_SYNC which would
> > > > not only send the signal but it would start a teardown of resources
> > > > owned by the task - at least those we can remove safely. The interface
> > > > would be much more simple and less tricky to use. You just make your
> > > > userspace oom killer or potentially other users call SIG_KILL_SYNC which
> > > > will be more expensive but you would at least know that as many
> > > > resources have been freed as the kernel can afford at the moment.
> > >
> > > Correct, my early RFC here
> > > https://patchwork.kernel.org/project/linux-mm/patch/20190411014353.113252-3-surenb@google.com
> > > was using a new flag for pidfd_send_signal() to request mm reaping by
> > > oom-reaper kthread. IIUC you propose to have a new SIG_KILL_SYNC
> > > signal instead of a new pidfd_send_signal() flag and otherwise a very
> > > similar solution. Is my understanding correct?
> >
> > Well, I think you shouldn't focus too much on the oom-reaper aspect
> > of it. Sure it can be used for that but I believe that a new signal
> > should provide a sync behavior. People more familiar with the process
> > management would be better off defining what is possible for a new sync
> > signal.  Ideally not only pro-active process destruction but also sync
> > waiting until the target process is released so that you know that once
> > kill syscall returns the process is gone.
>
> If your suggestion is for SIG_KILL_SYNC to perform victim's resource
> cleanup in the context of the caller while the victim is in
> uninterruptible sleep that would definitely be useful. I assume there
> are some resources which can't be reclaimed until the process itself
> wakes up and handles the SIGKILL. If so, I hope kill(SIG_KILL_SYNC)
> would not have to wait for the victim to wake up and handle the
> signal. This would really complicate the userspace in cases when we
> just want to reclaim whatever we can without victim's involvement and
> continue. For cases when waiting is required waitid() with P_PIDFD can
> be used.
> Would this semantic work?
>

To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
which in addition to sending a kill signal would also reap the
victim's mm in the context of the caller? Maybe having some code will
get the discussion moving forward?

> >
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-02 20:29           ` Suren Baghdasaryan
@ 2020-11-03  9:35             ` Michal Hocko
  2020-11-03 21:28               ` Suren Baghdasaryan
  2020-11-03 21:32               ` Minchan Kim
  0 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2020-11-03  9:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
[...]
> To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> which in addition to sending a kill signal would also reap the
> victim's mm in the context of the caller? Maybe having some code will
> get the discussion moving forward?

Yeah, having a code, even preliminary, might help here. This definitely
needs a good to go from process management people as that proper is land
full of surprises...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-03  9:35             ` Michal Hocko
@ 2020-11-03 21:28               ` Suren Baghdasaryan
  2020-11-03 21:32               ` Minchan Kim
  1 sibling, 0 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-11-03 21:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Tue, Nov 3, 2020 at 1:35 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> [...]
> > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > which in addition to sending a kill signal would also reap the
> > victim's mm in the context of the caller? Maybe having some code will
> > get the discussion moving forward?
>
> Yeah, having a code, even preliminary, might help here. This definitely
> needs a good to go from process management people as that proper is land
> full of surprises...

Hmm. Adding a new SIGKILL_SYNC seems to require quite a surgery...
That will cause a change to SIGRTMIN which is used by glib and also
would affect RTSIG_MAX which I think is also used by the userspace.
I'm not sure such a change would be acceptable...

IIUC, Minchan's alternative suggestion was to do mm reaping from
inside fsync() in cases like these:

pidfd_send_signal(pidfd, SIGKILL); // send SIGKILL as usual
fsync(pidfd); // causes mm reaping in the context of the caller

I can prototype this approach quite easily and quickly and post an RFC
to energize the discussion. Any objections?

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-03  9:35             ` Michal Hocko
  2020-11-03 21:28               ` Suren Baghdasaryan
@ 2020-11-03 21:32               ` Minchan Kim
  2020-11-03 21:40                 ` Suren Baghdasaryan
  2020-11-04  6:58                 ` Michal Hocko
  1 sibling, 2 replies; 26+ messages in thread
From: Minchan Kim @ 2020-11-03 21:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, linux-api, linux-mm, Andrew Morton,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> [...]
> > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > which in addition to sending a kill signal would also reap the
> > victim's mm in the context of the caller? Maybe having some code will
> > get the discussion moving forward?
> 
> Yeah, having a code, even preliminary, might help here. This definitely
> needs a good to go from process management people as that proper is land
> full of surprises...

Just to remind a idea I suggested to reuse existing concept

    fd = pidfd_open(victim process)
    fdatasync(fd);
    close(fd);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-03 21:32               ` Minchan Kim
@ 2020-11-03 21:40                 ` Suren Baghdasaryan
  2020-11-03 21:46                   ` Minchan Kim
  2020-11-04  6:58                 ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-11-03 21:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Tue, Nov 3, 2020 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > [...]
> > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > which in addition to sending a kill signal would also reap the
> > > victim's mm in the context of the caller? Maybe having some code will
> > > get the discussion moving forward?
> >
> > Yeah, having a code, even preliminary, might help here. This definitely
> > needs a good to go from process management people as that proper is land
> > full of surprises...
>
> Just to remind a idea I suggested to reuse existing concept
>
>     fd = pidfd_open(victim process)
>     fdatasync(fd);
>     close(fd);
>

Yep, I just posted a comment about that. I think though your above
sequence is missing a pidfd_send_signal(fd, SIGKILL) before the
fdatasync(fd)...
Not sure if fdatasync(pidfd) or fsync(pidfd) would be more appropriate
for this but will use one and we can discuss details in the RFC with
the code.
Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-03 21:40                 ` Suren Baghdasaryan
@ 2020-11-03 21:46                   ` Minchan Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2020-11-03 21:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Tue, Nov 03, 2020 at 01:40:41PM -0800, Suren Baghdasaryan wrote:
> On Tue, Nov 3, 2020 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > [...]
> > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > which in addition to sending a kill signal would also reap the
> > > > victim's mm in the context of the caller? Maybe having some code will
> > > > get the discussion moving forward?
> > >
> > > Yeah, having a code, even preliminary, might help here. This definitely
> > > needs a good to go from process management people as that proper is land
> > > full of surprises...
> >
> > Just to remind a idea I suggested to reuse existing concept
> >
> >     fd = pidfd_open(victim process)
> >     fdatasync(fd);
> >     close(fd);
> >
> 
> Yep, I just posted a comment about that. I think though your above
> sequence is missing a pidfd_send_signal(fd, SIGKILL) before the
> fdatasync(fd)...
> Not sure if fdatasync(pidfd) or fsync(pidfd) would be more appropriate
> for this but will use one and we can discuss details in the RFC with
> the code.

IMO, fdatasync would be better symantic since fsync invovles metadata
(i.e., task_struct, mm_struct and so on). I think what you need is
just pages attached to address_space, which sounds like data, not
metadata. 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-03 21:32               ` Minchan Kim
  2020-11-03 21:40                 ` Suren Baghdasaryan
@ 2020-11-04  6:58                 ` Michal Hocko
  2020-11-04 20:40                   ` Minchan Kim
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2020-11-04  6:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Suren Baghdasaryan, linux-api, linux-mm, Andrew Morton,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > [...]
> > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > which in addition to sending a kill signal would also reap the
> > > victim's mm in the context of the caller? Maybe having some code will
> > > get the discussion moving forward?
> > 
> > Yeah, having a code, even preliminary, might help here. This definitely
> > needs a good to go from process management people as that proper is land
> > full of surprises...
> 
> Just to remind a idea I suggested to reuse existing concept
> 
>     fd = pidfd_open(victim process)
>     fdatasync(fd);
>     close(fd);

I must have missed this proposal. Anyway, are you suggesting fdatasync
to act as a destructive operation?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-04  6:58                 ` Michal Hocko
@ 2020-11-04 20:40                   ` Minchan Kim
  2020-11-05 12:20                     ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2020-11-04 20:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, linux-api, linux-mm, Andrew Morton,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > [...]
> > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > which in addition to sending a kill signal would also reap the
> > > > victim's mm in the context of the caller? Maybe having some code will
> > > > get the discussion moving forward?
> > > 
> > > Yeah, having a code, even preliminary, might help here. This definitely
> > > needs a good to go from process management people as that proper is land
> > > full of surprises...
> > 
> > Just to remind a idea I suggested to reuse existing concept
> > 
> >     fd = pidfd_open(victim process)
> >     fdatasync(fd);
> >     close(fd);
> 
> I must have missed this proposal. Anyway, are you suggesting fdatasync
> to act as a destructive operation?

write(fd) && fdatasync(fd) are already destructive operation if the file
is shared.

You don't need to reaping as destruptive operation. Rather than, just
commit on the asynchrnous status "write file into page cache and commit
with fsync" and "killing process and commit with fsync".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-04 20:40                   ` Minchan Kim
@ 2020-11-05 12:20                     ` Michal Hocko
  2020-11-05 16:50                       ` Suren Baghdasaryan
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2020-11-05 12:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Suren Baghdasaryan, linux-api, linux-mm, Andrew Morton,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Christian Brauner, Oleg Nesterov, Tim Murray,
	kernel-team, LKML, Mel Gorman

On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > which in addition to sending a kill signal would also reap the
> > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > get the discussion moving forward?
> > > > 
> > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > needs a good to go from process management people as that proper is land
> > > > full of surprises...
> > > 
> > > Just to remind a idea I suggested to reuse existing concept
> > > 
> > >     fd = pidfd_open(victim process)
> > >     fdatasync(fd);
> > >     close(fd);
> > 
> > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > to act as a destructive operation?
> 
> write(fd) && fdatasync(fd) are already destructive operation if the file
> is shared.

I am likely missing something because fdatasync will not destroy any
underlying data. It will sync

> You don't need to reaping as destruptive operation. Rather than, just
> commit on the asynchrnous status "write file into page cache and commit
> with fsync" and "killing process and commit with fsync".

I am sorry but I do not follow. The result of the memory reaping is a
data loss. Any private mapping will simply lose it's content. The caller
will get EFAULT when trying to access it but there is no way to
reconstruct the data. This is everything but not resembling what I see
f{data}sync is used for.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-05 12:20                     ` Michal Hocko
@ 2020-11-05 16:50                       ` Suren Baghdasaryan
  2020-11-05 17:07                         ` Minchan Kim
  2020-11-05 17:16                         ` Michal Hocko
  0 siblings, 2 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-11-05 16:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu, Nov 5, 2020 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> > On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > > which in addition to sending a kill signal would also reap the
> > > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > > get the discussion moving forward?
> > > > >
> > > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > > needs a good to go from process management people as that proper is land
> > > > > full of surprises...
> > > >
> > > > Just to remind a idea I suggested to reuse existing concept
> > > >
> > > >     fd = pidfd_open(victim process)
> > > >     fdatasync(fd);
> > > >     close(fd);
> > >
> > > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > > to act as a destructive operation?
> >
> > write(fd) && fdatasync(fd) are already destructive operation if the file
> > is shared.
>
> I am likely missing something because fdatasync will not destroy any
> underlying data. It will sync
>
> > You don't need to reaping as destruptive operation. Rather than, just
> > commit on the asynchrnous status "write file into page cache and commit
> > with fsync" and "killing process and commit with fsync".
>
> I am sorry but I do not follow. The result of the memory reaping is a
> data loss. Any private mapping will simply lose it's content. The caller
> will get EFAULT when trying to access it but there is no way to
> reconstruct the data. This is everything but not resembling what I see
> f{data}sync is used for.

I think Minchan considers f{data}sync as a "commit" operation. So
write+f{data}sync would mean we write and commit written data,
kill+f{data}sync would mean we kill and commit that kill (reclaim the
resources).

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-05 16:50                       ` Suren Baghdasaryan
@ 2020-11-05 17:07                         ` Minchan Kim
  2020-11-05 17:16                         ` Michal Hocko
  1 sibling, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2020-11-05 17:07 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu, Nov 05, 2020 at 08:50:58AM -0800, Suren Baghdasaryan wrote:
> On Thu, Nov 5, 2020 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> > > On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > > > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > > > [...]
> > > > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > > > which in addition to sending a kill signal would also reap the
> > > > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > > > get the discussion moving forward?
> > > > > >
> > > > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > > > needs a good to go from process management people as that proper is land
> > > > > > full of surprises...
> > > > >
> > > > > Just to remind a idea I suggested to reuse existing concept
> > > > >
> > > > >     fd = pidfd_open(victim process)
> > > > >     fdatasync(fd);
> > > > >     close(fd);
> > > >
> > > > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > > > to act as a destructive operation?
> > >
> > > write(fd) && fdatasync(fd) are already destructive operation if the file
> > > is shared.
> >
> > I am likely missing something because fdatasync will not destroy any
> > underlying data. It will sync
> >
> > > You don't need to reaping as destruptive operation. Rather than, just
> > > commit on the asynchrnous status "write file into page cache and commit
> > > with fsync" and "killing process and commit with fsync".
> >
> > I am sorry but I do not follow. The result of the memory reaping is a
> > data loss. Any private mapping will simply lose it's content. The caller
> > will get EFAULT when trying to access it but there is no way to
> > reconstruct the data. This is everything but not resembling what I see
> > f{data}sync is used for.
> 
> I think Minchan considers f{data}sync as a "commit" operation. So
> write+f{data}sync would mean we write and commit written data,
> kill+f{data}sync would mean we kill and commit that kill (reclaim the
> resources).

If people doesn't like f{data}sync, ftruncate? My point is let's reuse
exising API since we have pidfd.

What I don't like about SIGKILL_SYNC is that it might introduce several
SIGXXX_SYNC later.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-05 16:50                       ` Suren Baghdasaryan
  2020-11-05 17:07                         ` Minchan Kim
@ 2020-11-05 17:16                         ` Michal Hocko
  2020-11-05 17:21                           ` Suren Baghdasaryan
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2020-11-05 17:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Minchan Kim, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu 05-11-20 08:50:58, Suren Baghdasaryan wrote:
> On Thu, Nov 5, 2020 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> > > On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > > > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > > > [...]
> > > > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > > > which in addition to sending a kill signal would also reap the
> > > > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > > > get the discussion moving forward?
> > > > > >
> > > > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > > > needs a good to go from process management people as that proper is land
> > > > > > full of surprises...
> > > > >
> > > > > Just to remind a idea I suggested to reuse existing concept
> > > > >
> > > > >     fd = pidfd_open(victim process)
> > > > >     fdatasync(fd);
> > > > >     close(fd);
> > > >
> > > > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > > > to act as a destructive operation?
> > >
> > > write(fd) && fdatasync(fd) are already destructive operation if the file
> > > is shared.
> >
> > I am likely missing something because fdatasync will not destroy any
> > underlying data. It will sync
> >
> > > You don't need to reaping as destruptive operation. Rather than, just
> > > commit on the asynchrnous status "write file into page cache and commit
> > > with fsync" and "killing process and commit with fsync".
> >
> > I am sorry but I do not follow. The result of the memory reaping is a
> > data loss. Any private mapping will simply lose it's content. The caller
> > will get EFAULT when trying to access it but there is no way to
> > reconstruct the data. This is everything but not resembling what I see
> > f{data}sync is used for.
> 
> I think Minchan considers f{data}sync as a "commit" operation.

But there is nothing like commit in that operation. It is simply a
destroy operation. ftruncate as Minchan mentions in another reply would
be a closer fit but how do you interpret the length argument? What about
memory regions which cannot be reaped?

I do understand that reusing an existing mechanism is usually preferable
but the semantic should be reasonable and easy to reason about.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-05 17:16                         ` Michal Hocko
@ 2020-11-05 17:21                           ` Suren Baghdasaryan
  2020-11-05 17:41                             ` Minchan Kim
  2020-11-05 17:43                             ` Michal Hocko
  0 siblings, 2 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-11-05 17:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu, Nov 5, 2020 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 05-11-20 08:50:58, Suren Baghdasaryan wrote:
> > On Thu, Nov 5, 2020 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> > > > On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > > > > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > > > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > > > > [...]
> > > > > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > > > > which in addition to sending a kill signal would also reap the
> > > > > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > > > > get the discussion moving forward?
> > > > > > >
> > > > > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > > > > needs a good to go from process management people as that proper is land
> > > > > > > full of surprises...
> > > > > >
> > > > > > Just to remind a idea I suggested to reuse existing concept
> > > > > >
> > > > > >     fd = pidfd_open(victim process)
> > > > > >     fdatasync(fd);
> > > > > >     close(fd);
> > > > >
> > > > > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > > > > to act as a destructive operation?
> > > >
> > > > write(fd) && fdatasync(fd) are already destructive operation if the file
> > > > is shared.
> > >
> > > I am likely missing something because fdatasync will not destroy any
> > > underlying data. It will sync
> > >
> > > > You don't need to reaping as destruptive operation. Rather than, just
> > > > commit on the asynchrnous status "write file into page cache and commit
> > > > with fsync" and "killing process and commit with fsync".
> > >
> > > I am sorry but I do not follow. The result of the memory reaping is a
> > > data loss. Any private mapping will simply lose it's content. The caller
> > > will get EFAULT when trying to access it but there is no way to
> > > reconstruct the data. This is everything but not resembling what I see
> > > f{data}sync is used for.
> >
> > I think Minchan considers f{data}sync as a "commit" operation.
>
> But there is nothing like commit in that operation. It is simply a
> destroy operation. ftruncate as Minchan mentions in another reply would
> be a closer fit but how do you interpret the length argument? What about
> memory regions which cannot be reaped?
>
> I do understand that reusing an existing mechanism is usually preferable
> but the semantic should be reasonable and easy to reason about.

Maybe then we can consider a flag for pidfd_send_signal() to indicate
that we want a synchronous mm cleanup when SIGKILL is being sent?
Similar to my original RFC but cleanup would happen in the context of
the caller. That seems to me like the simplest and most obvious way of
expressing what we want to accomplish. WDYT?

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-05 17:21                           ` Suren Baghdasaryan
@ 2020-11-05 17:41                             ` Minchan Kim
  2020-11-05 17:43                             ` Michal Hocko
  1 sibling, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2020-11-05 17:41 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu, Nov 05, 2020 at 09:21:13AM -0800, Suren Baghdasaryan wrote:
> On Thu, Nov 5, 2020 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 05-11-20 08:50:58, Suren Baghdasaryan wrote:
> > > On Thu, Nov 5, 2020 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> > > > > On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > > > > > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > > > > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > > > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > > > > > [...]
> > > > > > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > > > > > which in addition to sending a kill signal would also reap the
> > > > > > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > > > > > get the discussion moving forward?
> > > > > > > >
> > > > > > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > > > > > needs a good to go from process management people as that proper is land
> > > > > > > > full of surprises...
> > > > > > >
> > > > > > > Just to remind a idea I suggested to reuse existing concept
> > > > > > >
> > > > > > >     fd = pidfd_open(victim process)
> > > > > > >     fdatasync(fd);
> > > > > > >     close(fd);
> > > > > >
> > > > > > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > > > > > to act as a destructive operation?
> > > > >
> > > > > write(fd) && fdatasync(fd) are already destructive operation if the file
> > > > > is shared.
> > > >
> > > > I am likely missing something because fdatasync will not destroy any
> > > > underlying data. It will sync
> > > >
> > > > > You don't need to reaping as destruptive operation. Rather than, just
> > > > > commit on the asynchrnous status "write file into page cache and commit
> > > > > with fsync" and "killing process and commit with fsync".
> > > >
> > > > I am sorry but I do not follow. The result of the memory reaping is a
> > > > data loss. Any private mapping will simply lose it's content. The caller
> > > > will get EFAULT when trying to access it but there is no way to
> > > > reconstruct the data. This is everything but not resembling what I see
> > > > f{data}sync is used for.
> > >
> > > I think Minchan considers f{data}sync as a "commit" operation.
> >
> > But there is nothing like commit in that operation. It is simply a
> > destroy operation. ftruncate as Minchan mentions in another reply would
> > be a closer fit but how do you interpret the length argument? What about
> > memory regions which cannot be reaped?
> >
> > I do understand that reusing an existing mechanism is usually preferable
> > but the semantic should be reasonable and easy to reason about.
> 
> Maybe then we can consider a flag for pidfd_send_signal() to indicate
> that we want a synchronous mm cleanup when SIGKILL is being sent?
> Similar to my original RFC but cleanup would happen in the context of
> the caller. That seems to me like the simplest and most obvious way of
> expressing what we want to accomplish. WDYT?

I think that's better than introducing a specific synchronous kill.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-05 17:21                           ` Suren Baghdasaryan
  2020-11-05 17:41                             ` Minchan Kim
@ 2020-11-05 17:43                             ` Michal Hocko
  2020-11-05 18:02                               ` Suren Baghdasaryan
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2020-11-05 17:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Minchan Kim, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu 05-11-20 09:21:13, Suren Baghdasaryan wrote:
> On Thu, Nov 5, 2020 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 05-11-20 08:50:58, Suren Baghdasaryan wrote:
> > > On Thu, Nov 5, 2020 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> > > > > On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > > > > > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > > > > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > > > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > > > > > [...]
> > > > > > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > > > > > which in addition to sending a kill signal would also reap the
> > > > > > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > > > > > get the discussion moving forward?
> > > > > > > >
> > > > > > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > > > > > needs a good to go from process management people as that proper is land
> > > > > > > > full of surprises...
> > > > > > >
> > > > > > > Just to remind a idea I suggested to reuse existing concept
> > > > > > >
> > > > > > >     fd = pidfd_open(victim process)
> > > > > > >     fdatasync(fd);
> > > > > > >     close(fd);
> > > > > >
> > > > > > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > > > > > to act as a destructive operation?
> > > > >
> > > > > write(fd) && fdatasync(fd) are already destructive operation if the file
> > > > > is shared.
> > > >
> > > > I am likely missing something because fdatasync will not destroy any
> > > > underlying data. It will sync
> > > >
> > > > > You don't need to reaping as destruptive operation. Rather than, just
> > > > > commit on the asynchrnous status "write file into page cache and commit
> > > > > with fsync" and "killing process and commit with fsync".
> > > >
> > > > I am sorry but I do not follow. The result of the memory reaping is a
> > > > data loss. Any private mapping will simply lose it's content. The caller
> > > > will get EFAULT when trying to access it but there is no way to
> > > > reconstruct the data. This is everything but not resembling what I see
> > > > f{data}sync is used for.
> > >
> > > I think Minchan considers f{data}sync as a "commit" operation.
> >
> > But there is nothing like commit in that operation. It is simply a
> > destroy operation. ftruncate as Minchan mentions in another reply would
> > be a closer fit but how do you interpret the length argument? What about
> > memory regions which cannot be reaped?
> >
> > I do understand that reusing an existing mechanism is usually preferable
> > but the semantic should be reasonable and easy to reason about.
> 
> Maybe then we can consider a flag for pidfd_send_signal() to indicate
> that we want a synchronous mm cleanup when SIGKILL is being sent?
> Similar to my original RFC but cleanup would happen in the context of
> the caller. That seems to me like the simplest and most obvious way of
> expressing what we want to accomplish. WDYT?

Yes that would make sense. Althought it would have to be SIGKILL
specific flag IMO. But let's see what process management people think
about that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-05 17:43                             ` Michal Hocko
@ 2020-11-05 18:02                               ` Suren Baghdasaryan
  2020-11-13 17:37                                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-11-05 18:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu, Nov 5, 2020 at 9:44 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 05-11-20 09:21:13, Suren Baghdasaryan wrote:
> > On Thu, Nov 5, 2020 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 05-11-20 08:50:58, Suren Baghdasaryan wrote:
> > > > On Thu, Nov 5, 2020 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> > > > > > On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > > > > > > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > > > > > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > > > > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > > > > > > [...]
> > > > > > > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > > > > > > which in addition to sending a kill signal would also reap the
> > > > > > > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > > > > > > get the discussion moving forward?
> > > > > > > > >
> > > > > > > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > > > > > > needs a good to go from process management people as that proper is land
> > > > > > > > > full of surprises...
> > > > > > > >
> > > > > > > > Just to remind a idea I suggested to reuse existing concept
> > > > > > > >
> > > > > > > >     fd = pidfd_open(victim process)
> > > > > > > >     fdatasync(fd);
> > > > > > > >     close(fd);
> > > > > > >
> > > > > > > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > > > > > > to act as a destructive operation?
> > > > > >
> > > > > > write(fd) && fdatasync(fd) are already destructive operation if the file
> > > > > > is shared.
> > > > >
> > > > > I am likely missing something because fdatasync will not destroy any
> > > > > underlying data. It will sync
> > > > >
> > > > > > You don't need to reaping as destruptive operation. Rather than, just
> > > > > > commit on the asynchrnous status "write file into page cache and commit
> > > > > > with fsync" and "killing process and commit with fsync".
> > > > >
> > > > > I am sorry but I do not follow. The result of the memory reaping is a
> > > > > data loss. Any private mapping will simply lose it's content. The caller
> > > > > will get EFAULT when trying to access it but there is no way to
> > > > > reconstruct the data. This is everything but not resembling what I see
> > > > > f{data}sync is used for.
> > > >
> > > > I think Minchan considers f{data}sync as a "commit" operation.
> > >
> > > But there is nothing like commit in that operation. It is simply a
> > > destroy operation. ftruncate as Minchan mentions in another reply would
> > > be a closer fit but how do you interpret the length argument? What about
> > > memory regions which cannot be reaped?
> > >
> > > I do understand that reusing an existing mechanism is usually preferable
> > > but the semantic should be reasonable and easy to reason about.
> >
> > Maybe then we can consider a flag for pidfd_send_signal() to indicate
> > that we want a synchronous mm cleanup when SIGKILL is being sent?
> > Similar to my original RFC but cleanup would happen in the context of
> > the caller. That seems to me like the simplest and most obvious way of
> > expressing what we want to accomplish. WDYT?
>
> Yes that would make sense. Althought it would have to be SIGKILL
> specific flag IMO. But let's see what process management people think
> about that.

Very well, I'll brush up the implementation and will post a new RFC. Thanks!

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC]: userspace memory reaping
  2020-11-05 18:02                               ` Suren Baghdasaryan
@ 2020-11-13 17:37                                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-11-13 17:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, linux-api, linux-mm, Andrew Morton, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, kernel-team, LKML,
	Mel Gorman

On Thu, Nov 5, 2020 at 10:02 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Nov 5, 2020 at 9:44 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 05-11-20 09:21:13, Suren Baghdasaryan wrote:
> > > On Thu, Nov 5, 2020 at 9:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 05-11-20 08:50:58, Suren Baghdasaryan wrote:
> > > > > On Thu, Nov 5, 2020 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Wed 04-11-20 12:40:51, Minchan Kim wrote:
> > > > > > > On Wed, Nov 04, 2020 at 07:58:44AM +0100, Michal Hocko wrote:
> > > > > > > > On Tue 03-11-20 13:32:28, Minchan Kim wrote:
> > > > > > > > > On Tue, Nov 03, 2020 at 10:35:50AM +0100, Michal Hocko wrote:
> > > > > > > > > > On Mon 02-11-20 12:29:24, Suren Baghdasaryan wrote:
> > > > > > > > > > [...]
> > > > > > > > > > > To follow up on this. Should I post an RFC implementing SIGKILL_SYNC
> > > > > > > > > > > which in addition to sending a kill signal would also reap the
> > > > > > > > > > > victim's mm in the context of the caller? Maybe having some code will
> > > > > > > > > > > get the discussion moving forward?
> > > > > > > > > >
> > > > > > > > > > Yeah, having a code, even preliminary, might help here. This definitely
> > > > > > > > > > needs a good to go from process management people as that proper is land
> > > > > > > > > > full of surprises...
> > > > > > > > >
> > > > > > > > > Just to remind a idea I suggested to reuse existing concept
> > > > > > > > >
> > > > > > > > >     fd = pidfd_open(victim process)
> > > > > > > > >     fdatasync(fd);
> > > > > > > > >     close(fd);
> > > > > > > >
> > > > > > > > I must have missed this proposal. Anyway, are you suggesting fdatasync
> > > > > > > > to act as a destructive operation?
> > > > > > >
> > > > > > > write(fd) && fdatasync(fd) are already destructive operation if the file
> > > > > > > is shared.
> > > > > >
> > > > > > I am likely missing something because fdatasync will not destroy any
> > > > > > underlying data. It will sync
> > > > > >
> > > > > > > You don't need to reaping as destruptive operation. Rather than, just
> > > > > > > commit on the asynchrnous status "write file into page cache and commit
> > > > > > > with fsync" and "killing process and commit with fsync".
> > > > > >
> > > > > > I am sorry but I do not follow. The result of the memory reaping is a
> > > > > > data loss. Any private mapping will simply lose it's content. The caller
> > > > > > will get EFAULT when trying to access it but there is no way to
> > > > > > reconstruct the data. This is everything but not resembling what I see
> > > > > > f{data}sync is used for.
> > > > >
> > > > > I think Minchan considers f{data}sync as a "commit" operation.
> > > >
> > > > But there is nothing like commit in that operation. It is simply a
> > > > destroy operation. ftruncate as Minchan mentions in another reply would
> > > > be a closer fit but how do you interpret the length argument? What about
> > > > memory regions which cannot be reaped?
> > > >
> > > > I do understand that reusing an existing mechanism is usually preferable
> > > > but the semantic should be reasonable and easy to reason about.
> > >
> > > Maybe then we can consider a flag for pidfd_send_signal() to indicate
> > > that we want a synchronous mm cleanup when SIGKILL is being sent?
> > > Similar to my original RFC but cleanup would happen in the context of
> > > the caller. That seems to me like the simplest and most obvious way of
> > > expressing what we want to accomplish. WDYT?
> >
> > Yes that would make sense. Althought it would have to be SIGKILL
> > specific flag IMO. But let's see what process management people think
> > about that.
>
> Very well, I'll brush up the implementation and will post a new RFC. Thanks!
>

Sorry for the delay. The new RFC is posted at
https://lkml.org/lkml/2020/11/13/849

> >
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-11-13 17:37 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-15  0:43 [RFC]: userspace memory reaping Suren Baghdasaryan
2020-09-15  0:45 ` Suren Baghdasaryan
2020-10-14 12:09   ` Michal Hocko
2020-10-14 16:57     ` Suren Baghdasaryan
2020-10-14 18:39       ` minchan
2020-10-15  9:20       ` Michal Hocko
2020-10-15 18:43         ` Minchan Kim
2020-10-15 19:32           ` Suren Baghdasaryan
2020-10-15 19:25         ` Suren Baghdasaryan
2020-11-02 20:29           ` Suren Baghdasaryan
2020-11-03  9:35             ` Michal Hocko
2020-11-03 21:28               ` Suren Baghdasaryan
2020-11-03 21:32               ` Minchan Kim
2020-11-03 21:40                 ` Suren Baghdasaryan
2020-11-03 21:46                   ` Minchan Kim
2020-11-04  6:58                 ` Michal Hocko
2020-11-04 20:40                   ` Minchan Kim
2020-11-05 12:20                     ` Michal Hocko
2020-11-05 16:50                       ` Suren Baghdasaryan
2020-11-05 17:07                         ` Minchan Kim
2020-11-05 17:16                         ` Michal Hocko
2020-11-05 17:21                           ` Suren Baghdasaryan
2020-11-05 17:41                             ` Minchan Kim
2020-11-05 17:43                             ` Michal Hocko
2020-11-05 18:02                               ` Suren Baghdasaryan
2020-11-13 17:37                                 ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).