Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

From: Suren Baghdasaryan <surenb@google.com>
To: Daniel Colascione <dancol@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	yuzhoujian@didichuxing.com,
	Souptick Joarder <jrdr.linux@gmail.com>,
	Roman Gushchin <guro@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Shakeel Butt <shakeelb@google.com>,
	Christian Brauner <christian@brauner.io>,
	Minchan Kim <minchan@kernel.org>,
	Tim Murray <timmurray@google.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	Jann Horn <jannh@google.com>, linux-mm <linux-mm@kvack.org>,
	lsf-pc@lists.linux-foundation.org,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Android Kernel Team <kernel-team@android.com>,
	Oleg Nesterov <oleg@redhat.com>
Subject: Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
Date: Thu, 25 Apr 2019 09:09:46 -0700	[thread overview]
Message-ID: <CAJuCfpHQm5=QrOiOQ4thA5FmRUQjNGdSfis=iEaNGv7npmMVxQ@mail.gmail.com> (raw)
In-Reply-To: <CAKOZuetQH1rVtPdMNgw0sdnzWidd6v9eCWscRiOb7Y+3-JQ14Q@mail.gmail.com>

On Fri, Apr 12, 2019 at 7:14 AM Daniel Colascione <dancol@google.com> wrote:
>
> On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.
> > >
> > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > every time a process is going to die?
> >
> > Well, you are tearing down an address space which might be still in use
> > because the task not fully dead yeat. So there are two downsides AFAICS.
> > Core dumping which will not see the reaped memory so the resulting
>
> Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you
> try to start a core dump after reaping begins, too bad: you could have
> raced with process death anyway.
>
> > coredump might be incomplete. And unexpected #PF/gup on the reaped
> > memory will result in SIGBUS.
>
> It's a dying process. Why even bother returning from the fault
> handler? Just treat that situation as a thread exit. There's no need
> to make this observable to userspace at all.

I've spent some more time to investigate possible effects of reaping
on coredumps and asked Oleg Nesterov who worked on patchsets that
prioritize SIGKILLs over coredump activity
(https://lkml.org/lkml/2013/2/17/118). Current do_coredump
implementation seems to handle the case of SIGKILL interruption by
bailing out whenever dump_interrupted() returns true and that would be
the case with pending SIGKILL. So in the case of race when coredump
happens first and SIGKILL comes next interrupting the coredump seems
to result in no change in behavior and reaping memory proactively
seems to have no side effects.
An opposite race when SIGKILL gets posted and then coredump happens
seems impossible because do_coredump won't be called from get_signal
due to signal_group_exit check (get_signal checks signal_group_exit
while holding sighand->siglock and complete_signal sets
SIGNAL_GROUP_EXIT while holding the same lock). There is a path from
__seccomp_filter calling do_coredump while processing
SECCOMP_RET_KILL_xxx but even then it should bail out when
coredump_wait()->zap_threads(current) checks signal_group_exit().
(Thanks Oleg for clarifying this for me).

If we are really concerned about possible increase in failed coredumps
because of the proactive reaping I could make it conditional on
whether coredumping mm is possible by placing this feature behind
!get_dumpable(mm) condition. Another possibility is to check
RLIMIT_CORE to decide if coredumps are possible (although if pipe is
used for coredump that limit seems to be ignored, so that check would
have to take this into consideration).

On the issue of SIGBUS happening when accessed memory was already
reaped, my understanding that SIGBUS being a synchronous signal will
still have to be fetched using dequeue_synchronous_signal from
get_signal but not before signal_group_exit is checked. So again if
SIGKILL is pending I think SIGBUS will be ignored (please correct me
if that's not correct).

One additional question I would like to clarify is whether per-node
reapers like Roman suggested would make a big difference (All CPUs
that I've seen used for Android are single-node ones, so looking for
more feedback here). If it's important then reaping victim's memory by
the killer is probably not an option.