All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Add critical process prctl
@ 2019-09-05  0:53 Daniel Colascione
  2019-09-10 16:56 ` Andy Lutomirski
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel Colascione @ 2019-09-05  0:53 UTC (permalink / raw)
  To: dancol, timmurray, surenb, linux-kernel, linux-api

A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
meaning that if the task ever exits, the kernel panics. This facility
is intended for use by low-level core system processes that cannot
gracefully restart without a reboot. This prctl allows these processes
to ensure that the system restarts when they die regardless of whether
the rest of userspace is operational.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 include/linux/sched.h      |  5 +++++
 include/uapi/linux/prctl.h |  5 +++++
 kernel/exit.c              |  2 ++
 kernel/sys.c               | 19 +++++++++++++++++++
 4 files changed, 31 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..29420b9ebb63 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1526,6 +1526,7 @@ static inline bool is_percpu_thread(void)
 #define PFA_SPEC_IB_DISABLE		5	/* Indirect branch speculation restricted */
 #define PFA_SPEC_IB_FORCE_DISABLE	6	/* Indirect branch speculation permanently restricted */
 #define PFA_SPEC_SSB_NOEXEC		7	/* Speculative Store Bypass clear on execve() */
+#define PFA_CRITICAL                    8       /* Panic system if process exits */
 
 #define TASK_PFA_TEST(name, func)					\
 	static inline bool task_##func(struct task_struct *p)		\
@@ -1568,6 +1569,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
 TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
 TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
 
+TASK_PFA_TEST(CRITICAL, critical)
+TASK_PFA_SET(CRITICAL, critical)
+TASK_PFA_CLEAR(CRITICAL, critical)
+
 static inline void
 current_restore_flags(unsigned long orig_flags, unsigned long flags)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 094bb03b9cc2..4964723bbd47 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -229,4 +229,9 @@ struct prctl_mm_map {
 # define PR_PAC_APDBKEY			(1UL << 3)
 # define PR_PAC_APGAKEY			(1UL << 4)
 
+/* Per-task criticality control */
+#define PR_SET_TASK_CRITICAL 55
+#define PR_CRITICAL_NOT_CRITICAL 0
+#define PR_CRITICAL_CRITICAL 1
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..9b3d3411d935 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -788,6 +788,8 @@ void __noreturn do_exit(long code)
 		panic("Aiee, killing interrupt handler!");
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");
+	if (unlikely(task_critical(tsk)))
+		panic("Critical task died!");
 
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
diff --git a/kernel/sys.c b/kernel/sys.c
index 2969304c29fe..097e05ebaf94 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2269,6 +2269,20 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
 	return -EINVAL;
 }
 
+int task_do_set_critical(struct task_struct *t, unsigned long opt)
+{
+	if (opt != PR_CRITICAL_NOT_CRITICAL &&
+	    opt != PR_CRITICAL_CRITICAL)
+		return -EINVAL;
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (opt == PR_CRITICAL_NOT_CRITICAL)
+		task_clear_critical(t);
+	else
+		task_set_critical(t);
+	return 0;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2492,6 +2506,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = PAC_RESET_KEYS(me, arg2);
 		break;
+	case PR_SET_TASK_CRITICAL:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_do_set_critical(me, arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC] Add critical process prctl
  2019-09-05  0:53 [RFC] Add critical process prctl Daniel Colascione
@ 2019-09-10 16:56 ` Andy Lutomirski
  2019-09-10 17:42   ` Daniel Colascione
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Lutomirski @ 2019-09-10 16:56 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: Tim Murray, Suren Baghdasaryan, LKML, Linux API

On Wed, Sep 4, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
>
> A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
> meaning that if the task ever exits, the kernel panics. This facility
> is intended for use by low-level core system processes that cannot
> gracefully restart without a reboot. This prctl allows these processes
> to ensure that the system restarts when they die regardless of whether
> the rest of userspace is operational.

The kind of panic produced by init crashing is awful -- logs don't get
written, etc.  I'm wondering if you would be better off with a new
watchdog-like device that, when closed, kills the system in a
configurable way (e.g. after a certain amount of time, while still
logging something and having a decent chance of getting the logs
written out.)  This could plausibly even be an extension to the
existing /dev/watchdog API.

--Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Add critical process prctl
  2019-09-10 16:56 ` Andy Lutomirski
@ 2019-09-10 17:42   ` Daniel Colascione
  2019-09-10 18:15     ` Andy Lutomirski
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel Colascione @ 2019-09-10 17:42 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Tim Murray, Suren Baghdasaryan, LKML, Linux API

On Tue, Sep 10, 2019 at 9:57 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Sep 4, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
> > meaning that if the task ever exits, the kernel panics. This facility
> > is intended for use by low-level core system processes that cannot
> > gracefully restart without a reboot. This prctl allows these processes
> > to ensure that the system restarts when they die regardless of whether
> > the rest of userspace is operational.
>
> The kind of panic produced by init crashing is awful -- logs don't get
> written, etc.

True today --- but that's a separate problem, and one that can be
solved in a few ways, e.g., pre-registering log buffers to be
incorporated into any kexec kernel memory dumps. If a system aiming
for reliability can't diagnose panics, that's a problem with or
without my patch.

> I'm wondering if you would be better off with a new
> watchdog-like device that, when closed, kills the system in a
> configurable way (e.g. after a certain amount of time, while still
> logging something and having a decent chance of getting the logs
> written out.)  This could plausibly even be an extension to the
> existing /dev/watchdog API.

There are lots of approaches that work today: a few people have
suggested just having init watch processes, perhaps with pidfds. What
I worry about is increasing the length (both in terms of time and
complexity) of the critical path between something going wrong in a
critical process and the system getting back into a known-good state.
A panic at the earliest moment we know that a marked-critical process
has become doomed seems like the most reliable approach, especially
since alternatives can get backed up behind things like file
descriptor closing and various forms of scheduling delay.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Add critical process prctl
  2019-09-10 17:42   ` Daniel Colascione
@ 2019-09-10 18:15     ` Andy Lutomirski
  0 siblings, 0 replies; 4+ messages in thread
From: Andy Lutomirski @ 2019-09-10 18:15 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Andy Lutomirski, Tim Murray, Suren Baghdasaryan, LKML, Linux API

On Tue, Sep 10, 2019 at 10:43 AM Daniel Colascione <dancol@google.com> wrote:
>
> On Tue, Sep 10, 2019 at 9:57 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Sep 4, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
> > >
> > > A task with CAP_SYS_ADMIN can mark itself PR_SET_TASK_CRITICAL,
> > > meaning that if the task ever exits, the kernel panics. This facility
> > > is intended for use by low-level core system processes that cannot
> > > gracefully restart without a reboot. This prctl allows these processes
> > > to ensure that the system restarts when they die regardless of whether
> > > the rest of userspace is operational.
> >
> > The kind of panic produced by init crashing is awful -- logs don't get
> > written, etc.
>
> True today --- but that's a separate problem, and one that can be
> solved in a few ways, e.g., pre-registering log buffers to be
> incorporated into any kexec kernel memory dumps. If a system aiming
> for reliability can't diagnose panics, that's a problem with or
> without my patch.

It's been a problem for years and years and no one has convincingly
fixed it.  But the particular type of failure you're handling is
unlike most panics: no locks are held, nothing is corrupt, and the
kernel is generally functional.

>
> > I'm wondering if you would be better off with a new
> > watchdog-like device that, when closed, kills the system in a
> > configurable way (e.g. after a certain amount of time, while still
> > logging something and having a decent chance of getting the logs
> > written out.)  This could plausibly even be an extension to the
> > existing /dev/watchdog API.
>
> There are lots of approaches that work today: a few people have
> suggested just having init watch processes, perhaps with pidfds. What
> I worry about is increasing the length (both in terms of time and
> complexity) of the critical path between something going wrong in a
> critical process and the system getting back into a known-good state.
> A panic at the earliest moment we know that a marked-critical process
> has become doomed seems like the most reliable approach, especially
> since alternatives can get backed up behind things like file
> descriptor closing and various forms of scheduling delay.

I think this all depends on exactly what types of failures you care
about.  If the kernel is dead (actually crashed, deadlocked, or merely
livelocked or otherwise failing to make progress) then you have no
particular guarantee that your critical task will make it to do_exit()
in the first place.  Otherwise, I see no real reason that you should
panic immediately in do_exit() rather than waiting the tiny amount of
time it would take for a watchdog driver to notice that the descriptor
was closed.

So, if I were designing this, I think I would want to use a watchdog.
Program it to die immediately if the descriptor is closed and also
program it to die if the descriptor isn't pinged periodically.  The
latter catches the case where the system is failing to make progress.
And "die" can mean "notify a logging daemon and give it five seconds
to do it's thing and declare it's done; panic or reboot after five
seconds if it doesn't declare that it's done."

--Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-10 18:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-05  0:53 [RFC] Add critical process prctl Daniel Colascione
2019-09-10 16:56 ` Andy Lutomirski
2019-09-10 17:42   ` Daniel Colascione
2019-09-10 18:15     ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.