linux-m68k.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
@ 2021-06-10 20:57 Eric W. Biederman
  2021-06-10 22:04 ` Linus Torvalds
  2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz
  0 siblings, 2 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-10 20:57 UTC (permalink / raw)
  To: linux-arch
  Cc: Jens Axboe, Oleg Nesterov, Al Viro, Linus Torvalds, linux-kernel,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, linux-alpha,
	Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan,
	Tejun Heo, Daniel Jacobowitz, Kees Cook


Folks,

Digging through the guts of exit I found something I am not quite
certain what to do with.  On some architectures such as alpha, m68k, and
nios2 the kernel calls into system calls with a subset of the registers
saved on the kernel stack, and the kernel calls into signal handling and
a few other contexts with all of the registers saved on the kernel
stack.  The problem is sometimes we read all of the registers from
a context where they are not all saved.

When this was initially observed it looked just like a coredump problem
and it could be solved by tweaking the coredump code.  That change was
77f6ab8b7768 ("don't dump the threads that had been already exiting when
zapped.")

However I have looked farther and we have the location where get_signal
is called from io_uring, and we have the ptrace_stop in
PTRACE_EVENT_EXIT.  In PTRACE_EVENT_EXIT we could be called from exit(2)
which is a syscall and we definitely won't have everything saved on the
kernel stack.  I have not doubled checked create_io_thread but I don't
think create_io_threads saves all of the registers on the kernel stack.

I think at this point we need to say that the architectures that have a
do this need to be fixed to at least call do_exit and the kernel
function in create_io_thread with the deeper stack.

Is that reasonable of me to ask?  Is there some other way to deal with
this issue that I am not seeing?  Am I missing some critical detail that
makes PTRACE_EVENT_EXIT in do_exit not a problem if someone reads the
register with ptrace?

Eric


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-10 20:57 Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Eric W. Biederman
@ 2021-06-10 22:04 ` Linus Torvalds
  2021-06-11 21:39   ` Eric W. Biederman
  2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz
  1 sibling, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-10 22:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz,
	Kees Cook

On Thu, Jun 10, 2021 at 1:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The problem is sometimes we read all of the registers from
> a context where they are not all saved.

Ouch. Yes. And this is really painful because none of the *normal*
architectures do this, so it gets absolutely no coverage.

> I think at this point we need to say that the architectures that have a
> do this need to be fixed to at least call do_exit and the kernel
> function in create_io_thread with the deeper stack.

Yeah. We traditionally have that requirement for fork() and friends
too (vfork/clone), so adding exit and io_uring to do so seems like the
most straightforward thing.

But I really wish we had some way to test and trigger this so that we
wouldn't get caught on this before. Something in task_pt_regs() that
catches "this doesn't actually work" and does a WARN_ON_ONCE() on the
affected architectures?

               Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-10 22:04 ` Linus Torvalds
@ 2021-06-11 21:39   ` Eric W. Biederman
  2021-06-11 23:26     ` Linus Torvalds
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-11 21:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz,
	Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jun 10, 2021 at 1:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The problem is sometimes we read all of the registers from
>> a context where they are not all saved.
>
> Ouch. Yes. And this is really painful because none of the *normal*
> architectures do this, so it gets absolutely no coverage.
>
>> I think at this point we need to say that the architectures that have a
>> do this need to be fixed to at least call do_exit and the kernel
>> function in create_io_thread with the deeper stack.
>
> Yeah. We traditionally have that requirement for fork() and friends
> too (vfork/clone), so adding exit and io_uring to do so seems like the
> most straightforward thing.

Interesting.  I am starting with Al's analysis and reading the code
to see if I can understand what is going on.  So I am still glossing
over a few details as I dig into this.  Kernel threads not having
all of their registers saved is one of those details.

Looking at copy_thread it looks like at least on alpha we are dealing
with a structure that defines all of the registers in copy_thread.  So
perhaps all of the registers are there in kernel_threads already.  I
don't read alpha assembly very well and fork is a bit subtle.  I don't
know which piece of code is calling
ret_from_fork/ret_from_kernel_thread.

I really suspect that all of those registers are popped so at least for
IO_THREADS we need to push them again, in a way that signal_pt_regs()
can find them.

It looks like we just need something like this to cover the userspace
side of exit.

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index e227f3a29a43..ab0dcb545bd1 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -812,6 +812,22 @@ fork_like fork
 fork_like vfork
 fork_like clone
 
+.macro exit_like name
+	.align	4
+	.globl	alpha_\name
+	.ent	alpha_\name
+alpha_\name:
+	.prologue 0
+	DO_SWITCH_STACK
+	jsr	$26, sys_\name
+	UNDO_SWITCH_STACK
+	ret
+.end	alpha_\name
+.endm
+
+exit_like exit
+exit_like exit_group
+
 .macro	sigreturn_like name
 	.align	4
 	.globl	sys_\name
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..b9d6449d6caa 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -8,7 +8,7 @@
 # The <abi> is always "common" for this file
 #
 0	common	osf_syscall			alpha_syscall_zero
-1	common	exit				sys_exit
+1	common	exit				alpha_exit
 2	common	fork				alpha_fork
 3	common	read				sys_read
 4	common	write				sys_write
@@ -333,7 +333,7 @@
 400	common	io_getevents			sys_io_getevents
 401	common	io_submit			sys_io_submit
 402	common	io_cancel			sys_io_cancel
-405	common	exit_group			sys_exit_group
+405	common	exit_group			alpha_exit_group
 406	common	lookup_dcookie			sys_lookup_dcookie
 407	common	epoll_create			sys_epoll_create
 408	common	epoll_ctl			sys_epoll_ctl


> But I really wish we had some way to test and trigger this so that we
> wouldn't get caught on this before. Something in task_pt_regs() that
> catches "this doesn't actually work" and does a WARN_ON_ONCE() on the
> affected architectures?

I think that would require pushing an extra magic value in SWITCH_STACK
and not just popping it but deliberately changing that value in
UNDO_SWITCH_STACK.  Basically stack canaries.

I don't see how we could do it in an arch independent way though.
Which means it will require auditing all of the architectures to get
there. Volunteers?

This is looking straight forward enough that I can probably pull
something together, just don't count on me to have it done in anything
resembling a timely manner.

Eric

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-11 21:39   ` Eric W. Biederman
@ 2021-06-11 23:26     ` Linus Torvalds
  2021-06-13 21:54       ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-11 23:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz,
	Kees Cook

On Fri, Jun 11, 2021 at 2:40 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Looking at copy_thread it looks like at least on alpha we are dealing
> with a structure that defines all of the registers in copy_thread.

On the target side, yes.

On the _source_ side, the code does

        struct pt_regs *regs = current_pt_regs();

and that's the part that means that fork() and related functions need
to have done that DO_SWITCH_STACK(), so that they have the full
register set to be copied.

Otherwise it would copy random contents from the source stack.

But that

        if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {

ends up protecting us, and the code never uses that set of source
registers for the io worker threads.

So io_uring looks fine on alpha. I didn't check m68k and friends, but
I think they have the same thing going.

> It looks like we just need something like this to cover the userspace
> side of exit.

Looks correct to me. Except I think you could just use "fork_like()"
instead of creating a new (and identical) "exit_like()" macro.

> > But I really wish we had some way to test and trigger this so that we
> > wouldn't get caught on this before. Something in task_pt_regs() that
> > catches "this doesn't actually work" and does a WARN_ON_ONCE() on the
> > affected architectures?
>
> I think that would require pushing an extra magic value in SWITCH_STACK
> and not just popping it but deliberately changing that value in
> UNDO_SWITCH_STACK.  Basically stack canaries.
>
> I don't see how we could do it in an arch independent way though.

No, I think you're right. There's no obvious generic solution to it,
and once we look at arch-specific ones we're vback to "just alpha,
m68k and nios needs this or cares" and tonce you're there you might as
well just fix it.

ia64 has soem "fast system call" model with limited registers too, but
I think that's limited to just a few very special system calls (ie it
does the reverse of what alpha does: alpha does the fast case by
default, and then marks fork/vfork/clone as special).

             Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-10 20:57 Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Eric W. Biederman
  2021-06-10 22:04 ` Linus Torvalds
@ 2021-06-12 23:38 ` Michael Schmitz
  2021-06-13 19:59   ` Linus Torvalds
  2021-06-14  7:13   ` Michael Schmitz
  1 sibling, 2 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-12 23:38 UTC (permalink / raw)
  To: geert, linux-arch, linux-m68k; +Cc: ebiederm, torvalds, schwab, Michael Schmitz

do_exit() calls prace_stop() which may require access to all saved
registers. We only save those registers not preserved by C code
currently.

Provide a special syscall entry for exit and exit_group syscalls
similar to that used by clone and clone3, which have the same
requirements.

No fix to io_uring appears to be needed, because m68k copy_thread
treats kernel threads the same as e.g. alpha does, and copies only
a subset of registers in that case.

CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
---
 arch/m68k/kernel/entry.S              | 14 ++++++++++++++
 arch/m68k/kernel/process.c            | 16 ++++++++++++++++
 arch/m68k/kernel/syscalls/syscall.tbl |  4 ++--
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/m68k/kernel/entry.S b/arch/m68k/kernel/entry.S
index 9dd76fb..1e067e6 100644
--- a/arch/m68k/kernel/entry.S
+++ b/arch/m68k/kernel/entry.S
@@ -76,6 +76,20 @@ ENTRY(__sys_clone3)
 	lea	%sp@(28),%sp
 	rts
 
+ENTRY(__sys_exit)
+	SAVE_SWITCH_STACK
+	pea	%sp@(SWITCH_STACK_SIZE)
+	jbsr	m68k_exit
+	lea	%sp@(28),%sp
+	rts
+
+ENTRY(__sys_exit_group)
+	SAVE_SWITCH_STACK
+	pea	%sp@(SWITCH_STACK_SIZE)
+	jbsr	m68k_exit_group
+	lea	%sp@(28),%sp
+	rts
+
 ENTRY(sys_sigreturn)
 	SAVE_SWITCH_STACK
 	movel	%sp,%sp@-		  | switch_stack pointer
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index da83cc8..df4e5f1 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -138,6 +138,22 @@ asmlinkage int m68k_clone3(struct pt_regs *regs)
 	return sys_clone3((struct clone_args __user *)regs->d1, regs->d2);
 }
 
+/*
+ * Because extra registers are saved on the stack after the sys_exit()
+ * arguments, this C wrapper extracts them from pt_regs * and then calls the
+ * generic sys_exit() implementation.
+ */
+asmlinkage int m68k_exit(struct pt_regs *regs)
+{
+	return sys_exit(regs->d1);
+}
+
+/* Same for sys_exit_group ... */
+asmlinkage int m68k_exit_group(struct pt_regs *regs)
+{
+	return sys_exit_group(regs->d1);
+}
+
 int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
 		struct task_struct *p, unsigned long tls)
 {
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 0dd019d..3d5b6fbc 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -8,7 +8,7 @@
 # The <abi> is always "common" for this file
 #
 0	common	restart_syscall			sys_restart_syscall
-1	common	exit				sys_exit
+1	common	exit				__sys_exit
 2	common	fork				__sys_fork
 3	common	read				sys_read
 4	common	write				sys_write
@@ -254,7 +254,7 @@
 244	common	io_submit			sys_io_submit
 245	common	io_cancel			sys_io_cancel
 246	common	fadvise64			sys_fadvise64
-247	common	exit_group			sys_exit_group
+247	common	exit_group			__sys_exit_group
 248	common	lookup_dcookie			sys_lookup_dcookie
 249	common	epoll_create			sys_epoll_create
 250	common	epoll_ctl			sys_epoll_ctl
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz
@ 2021-06-13 19:59   ` Linus Torvalds
  2021-06-13 20:07     ` Michael Schmitz
  2021-06-14  7:13   ` Michael Schmitz
  1 sibling, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-13 19:59 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman,
	Andreas Schwab

On Sat, Jun 12, 2021 at 4:38 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
>
> do_exit() calls prace_stop() which may require access to all saved
> registers. We only save those registers not preserved by C code
> currently.
>
> Provide a special syscall entry for exit and exit_group syscalls
> similar to that used by clone and clone3, which have the same
> requirements.

ACK, this looks correct to me.

It might be a good idea to generate a test-case for this - some
"ptrace child, catch exit of it, show registers" kind of thing - just
to show what the effects of the bug was (and to show it's fixed). But
maybe it's not worth the effort.

                  Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-13 19:59   ` Linus Torvalds
@ 2021-06-13 20:07     ` Michael Schmitz
  2021-06-13 20:26       ` Linus Torvalds
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-13 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman,
	Andreas Schwab

Linus,

On 14/06/21 7:59 am, Linus Torvalds wrote:
> On Sat, Jun 12, 2021 at 4:38 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
>> do_exit() calls prace_stop() which may require access to all saved
>> registers. We only save those registers not preserved by C code
>> currently.
>>
>> Provide a special syscall entry for exit and exit_group syscalls
>> similar to that used by clone and clone3, which have the same
>> requirements.
> ACK, this looks correct to me.
>
> It might be a good idea to generate a test-case for this - some
> "ptrace child, catch exit of it, show registers" kind of thing - just
> to show what the effects of the bug was (and to show it's fixed). But
> maybe it's not worth the effort.

I'd love that, too. My test rig doesn't allow dumping of registers by 
strace, but someone else may have that capacity.

Cheers,

     Michael

>
>                    Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-13 20:07     ` Michael Schmitz
@ 2021-06-13 20:26       ` Linus Torvalds
  2021-06-13 20:33         ` Linus Torvalds
  2021-06-13 20:47         ` Linus Torvalds
  0 siblings, 2 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-13 20:26 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman,
	Andreas Schwab

On Sun, Jun 13, 2021 at 1:07 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
>
> I'd love that, too. My test rig doesn't allow dumping of registers by
> strace, but someone else may have that capacity.

I think doing it manually with gdb should be fairly straightforward.

Something like

        gdb /bin/true

and then in gdb you just do

        b main
        run

and then

        catch syscall group:process
        c

and it should stop at the exit_group or exit system call.

At that point you can just do

        info registers

and see if they match what user space *should* be. They'll probably be
complete garbage without the fix.

I do not have an alpha or m68k machine to test (and not the
energy/inclination to set up some virtual environment in qemu either).
But it should be easy if you already have that environment.

             Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-13 20:26       ` Linus Torvalds
@ 2021-06-13 20:33         ` Linus Torvalds
  2021-06-13 20:47         ` Linus Torvalds
  1 sibling, 0 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-13 20:33 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman,
	Andreas Schwab

On Sun, Jun 13, 2021 at 1:26 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> and then in gdb you just do
>
>         b main
>         run

Btw, this extra stage is unnecessary, but if I just do that "catch
syscall group:process" before the process has even started, gdb gets
confused at the start.

You could skip this and just do "catch syscall exit_group" and then "run".

I used that "group:process" just to catch both the legacy "exit" and
the new "exit_group", but then it catches fork/execve too, and I think
that's what confuses gdb when it happens as you start the process.

Just to clarify why I did that odd thing.

             Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-13 20:26       ` Linus Torvalds
  2021-06-13 20:33         ` Linus Torvalds
@ 2021-06-13 20:47         ` Linus Torvalds
  1 sibling, 0 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-13 20:47 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman,
	Andreas Schwab

On Sun, Jun 13, 2021 at 1:26 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> They'll probably be complete garbage without the fix.

Actually, never mind. My trivial gdb script is garbage.

I think that even with the fix, it will be fine. Because this test
will just use the regular system call entry tracing point - which gets
the thing right.

It's only PTRACE_EVENT_EXIT reporting that gets it wrong, not the
generic system call tracing case.

I'm not sure if/how you can get gdb to catch that PTRACE_EVENT_EXIT case.

Sorry for my inane noise.

                 Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-11 23:26     ` Linus Torvalds
@ 2021-06-13 21:54       ` Eric W. Biederman
  2021-06-13 22:18         ` Linus Torvalds
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-13 21:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz,
	Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Jun 11, 2021 at 2:40 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Looking at copy_thread it looks like at least on alpha we are dealing
>> with a structure that defines all of the registers in copy_thread.
>
> On the target side, yes.
>
> On the _source_ side, the code does
>
>         struct pt_regs *regs = current_pt_regs();
>
> and that's the part that means that fork() and related functions need
> to have done that DO_SWITCH_STACK(), so that they have the full
> register set to be copied.
>
> Otherwise it would copy random contents from the source stack.
>
> But that
>
>         if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>
> ends up protecting us, and the code never uses that set of source
> registers for the io worker threads.

The test in copy_thread.  That isn't the case I am worried about.

> So io_uring looks fine on alpha. I didn't check m68k and friends, but
> I think they have the same thing going.

As I have read through the code more I don't think so.

The code paths I am worried about are:

	ret_from_kernel_thread
        	io_wqe_worker
                	get_signal
                        	do_coredump
                        	ptrace_stop

	ret_from_kernel_thread
        	io_sq_thread
                	get_signal
                        	do_coredump
                        	ptrace_stop


As I understand the code the new thread created by create_thread
initially has a full complement of registers, and then is started
by alpha_switch_to:

	.align	4
	.globl	alpha_switch_to
	.type	alpha_switch_to, @function
	.cfi_startproc
alpha_switch_to:
	DO_SWITCH_STACK
	call_pal PAL_swpctx
	lda	$8, 0x3fff
	UNDO_SWITCH_STACK
	bic	$sp, $8, $8
	mov	$17, $0
	ret
	.cfi_endproc
	.size	alpha_switch_to, .-alpha_switch_to


The alpha_switch_to will remove the extra registers from the stack and
then call ret which if I understand alpha assembly correctly is
equivalent to jumping to where $26 points.  Which is
ret_from_kernel_thread (as setup by copy_thread).

Which leaves ret_from_kernel_thread and everything it calls without
the extra context saved on the stack.

I am still trying to understand how we get registers populated at a
fixed offset on the stack during schedule.  As it looks like switch_to
assumes the stack pointer is in the proper location.

>> It looks like we just need something like this to cover the userspace
>> side of exit.
>
> Looks correct to me. Except I think you could just use "fork_like()"
> instead of creating a new (and identical) "exit_like()" macro.
>
>> > But I really wish we had some way to test and trigger this so that we
>> > wouldn't get caught on this before. Something in task_pt_regs() that
>> > catches "this doesn't actually work" and does a WARN_ON_ONCE() on the
>> > affected architectures?
>>
>> I think that would require pushing an extra magic value in SWITCH_STACK
>> and not just popping it but deliberately changing that value in
>> UNDO_SWITCH_STACK.  Basically stack canaries.
>>
>> I don't see how we could do it in an arch independent way though.
>
> No, I think you're right. There's no obvious generic solution to it,
> and once we look at arch-specific ones we're vback to "just alpha,
> m68k and nios needs this or cares" and tonce you're there you might as
> well just fix it.
>
> ia64 has soem "fast system call" model with limited registers too, but
> I think that's limited to just a few very special system calls (ie it
> does the reverse of what alpha does: alpha does the fast case by
> default, and then marks fork/vfork/clone as special).

I wonder if the arch specific solution should be to move the registers
to a fixed location in task_struct (perhaps thread_struct ) so that the
same patterns can apply across all architectures and we don't get
surprises at all.

What appears to be unique about alpha, m68k, and nios is that
space is not always reserved for all of the registers, so we can't
always count on them being saved after a task switch.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-13 21:54       ` Eric W. Biederman
@ 2021-06-13 22:18         ` Linus Torvalds
  2021-06-14  2:05           ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-13 22:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz,
	Kees Cook

[-- Attachment #1: Type: text/plain, Size: 1127 bytes --]

On Sun, Jun 13, 2021 at 2:55 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The alpha_switch_to will remove the extra registers from the stack and
> then call ret which if I understand alpha assembly correctly is
> equivalent to jumping to where $26 points.  Which is
> ret_from_kernel_thread (as setup by copy_thread).
>
> Which leaves ret_from_kernel_thread and everything it calls without
> the extra context saved on the stack.

Uhhuh. Right you are, I think. It's been ages since I worked on that
code and my alpha handbook is somewhere else, but yes, when
alpha_switch_to() has context-switched to the new PCB state, it will
then pop those registers in the new context and return.

So we do set up the right stack frame for the worker thread, but as
you point out, it then gets used up immediately when running. So by
the time the IO worker thread calls get_signal(), it's no longer
useful.

How very annoying.

The (obviously UNTESTED) patch might be something like the attached.

I wouldn't be surprised if m68k has the exact same thing for the exact
same reason, but I didn't check..

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 982 bytes --]

 arch/alpha/kernel/process.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5112ab996394..edbfe03f4b2c 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -251,8 +251,17 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
+		/*
+		 * Give it *two* switch stacks, one for the kernel
+		 * state return that is used up by alpha_switch_to,
+		 * and one for the "user state" which is accessed
+		 * by ptrace.
+		 */
+		childstack--;
+		childti->pcb.ksp = (unsigned long) childstack;
+
 		memset(childstack, 0,
-			sizeof(struct switch_stack) + sizeof(struct pt_regs));
+			2*sizeof(struct switch_stack) + sizeof(struct pt_regs));
 		childstack->r26 = (unsigned long) ret_from_kernel_thread;
 		childstack->r9 = usp;	/* function */
 		childstack->r10 = kthread_arg;

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-13 22:18         ` Linus Torvalds
@ 2021-06-14  2:05           ` Michael Schmitz
  2021-06-14  5:03             ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-14  2:05 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz,
	Kees Cook

Hi Linus,

On 14/06/21 10:18 am, Linus Torvalds wrote:
> On Sun, Jun 13, 2021 at 2:55 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> The alpha_switch_to will remove the extra registers from the stack and
>> then call ret which if I understand alpha assembly correctly is
>> equivalent to jumping to where $26 points.  Which is
>> ret_from_kernel_thread (as setup by copy_thread).
>>
>> Which leaves ret_from_kernel_thread and everything it calls without
>> the extra context saved on the stack.
> Uhhuh. Right you are, I think. It's been ages since I worked on that
> code and my alpha handbook is somewhere else, but yes, when
> alpha_switch_to() has context-switched to the new PCB state, it will
> then pop those registers in the new context and return.
>
> So we do set up the right stack frame for the worker thread, but as
> you point out, it then gets used up immediately when running. So by
> the time the IO worker thread calls get_signal(), it's no longer
> useful.
>
> How very annoying.
>
> The (obviously UNTESTED) patch might be something like the attached.
>
> I wouldn't be surprised if m68k has the exact same thing for the exact
> same reason, but I didn't check..

m68k is indeed similar, it has:

        if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
                 /* kernel thread */
                 memset(frame, 0, sizeof(struct fork_frame));
                 frame->regs.sr = PS_S;
                 frame->sw.a3 = usp; /* function */
                 frame->sw.d7 = arg;
                 frame->sw.retpc = (unsigned long)ret_from_kernel_thread;
                 p->thread.usp = 0;
                 return 0;
         }

so a similar patch should be possible.

Cheers,

     Michael



>
>                    Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-14  2:05           ` Michael Schmitz
@ 2021-06-14  5:03             ` Michael Schmitz
  2021-06-14 16:26               ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-14  5:03 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On second thought, I'm not certain what adding another empty stack frame 
would achieve here.

On m68k, 'frame' already is a new stack frame, for running the new 
thread in. This new frame does not have any user context at all, and 
it's explicitly wiped anyway.

Unless we save all user context on the stack, then push that context to 
a new save frame, and somehow point get_signal to look there for IO 
threads (essentially what Eric suggested), I don't see how this could work?

I must be missing something.

Cheers,

	Michael Schmitz

Am 14.06.2021 um 14:05 schrieb Michael Schmitz:
>>
>> I wouldn't be surprised if m68k has the exact same thing for the exact
>> same reason, but I didn't check..
>
> m68k is indeed similar, it has:
>
>        if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>                 /* kernel thread */
>                 memset(frame, 0, sizeof(struct fork_frame));
>                 frame->regs.sr = PS_S;
>                 frame->sw.a3 = usp; /* function */
>                 frame->sw.d7 = arg;
>                 frame->sw.retpc = (unsigned long)ret_from_kernel_thread;
>                 p->thread.usp = 0;
>                 return 0;
>         }
>
> so a similar patch should be possible.
>
> Cheers,
>
>     Michael
>
>
>
>>
>>                    Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz
  2021-06-13 19:59   ` Linus Torvalds
@ 2021-06-14  7:13   ` Michael Schmitz
  2021-06-14  7:40     ` Andreas Schwab
  1 sibling, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-14  7:13 UTC (permalink / raw)
  To: geert, linux-m68k; +Cc: schwab, Kars de Jong

Hi Geert,

do we need to add

.globl __sys_exit, __sys_exit_group

(and perhaps __sys_clone3) at the start of entry.S? We have that for 
__sys_fork, __sys_clone and __sys_vfork.

Cheers,

	Michael


Am 13.06.2021 um 11:38 schrieb Michael Schmitz:
> do_exit() calls prace_stop() which may require access to all saved
> registers. We only save those registers not preserved by C code
> currently.
>
> Provide a special syscall entry for exit and exit_group syscalls
> similar to that used by clone and clone3, which have the same
> requirements.
>
> No fix to io_uring appears to be needed, because m68k copy_thread
> treats kernel threads the same as e.g. alpha does, and copies only
> a subset of registers in that case.
>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Andreas Schwab <schwab@linux-m68k.org>
> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
> ---
>  arch/m68k/kernel/entry.S              | 14 ++++++++++++++
>  arch/m68k/kernel/process.c            | 16 ++++++++++++++++
>  arch/m68k/kernel/syscalls/syscall.tbl |  4 ++--
>  3 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/arch/m68k/kernel/entry.S b/arch/m68k/kernel/entry.S
> index 9dd76fb..1e067e6 100644
> --- a/arch/m68k/kernel/entry.S
> +++ b/arch/m68k/kernel/entry.S
> @@ -76,6 +76,20 @@ ENTRY(__sys_clone3)
>  	lea	%sp@(28),%sp
>  	rts
>
> +ENTRY(__sys_exit)
> +	SAVE_SWITCH_STACK
> +	pea	%sp@(SWITCH_STACK_SIZE)
> +	jbsr	m68k_exit
> +	lea	%sp@(28),%sp
> +	rts
> +
> +ENTRY(__sys_exit_group)
> +	SAVE_SWITCH_STACK
> +	pea	%sp@(SWITCH_STACK_SIZE)
> +	jbsr	m68k_exit_group
> +	lea	%sp@(28),%sp
> +	rts
> +
>  ENTRY(sys_sigreturn)
>  	SAVE_SWITCH_STACK
>  	movel	%sp,%sp@-		  | switch_stack pointer
> diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
> index da83cc8..df4e5f1 100644
> --- a/arch/m68k/kernel/process.c
> +++ b/arch/m68k/kernel/process.c
> @@ -138,6 +138,22 @@ asmlinkage int m68k_clone3(struct pt_regs *regs)
>  	return sys_clone3((struct clone_args __user *)regs->d1, regs->d2);
>  }
>
> +/*
> + * Because extra registers are saved on the stack after the sys_exit()
> + * arguments, this C wrapper extracts them from pt_regs * and then calls the
> + * generic sys_exit() implementation.
> + */
> +asmlinkage int m68k_exit(struct pt_regs *regs)
> +{
> +	return sys_exit(regs->d1);
> +}
> +
> +/* Same for sys_exit_group ... */
> +asmlinkage int m68k_exit_group(struct pt_regs *regs)
> +{
> +	return sys_exit_group(regs->d1);
> +}
> +
>  int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
>  		struct task_struct *p, unsigned long tls)
>  {
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 0dd019d..3d5b6fbc 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -8,7 +8,7 @@
>  # The <abi> is always "common" for this file
>  #
>  0	common	restart_syscall			sys_restart_syscall
> -1	common	exit				sys_exit
> +1	common	exit				__sys_exit
>  2	common	fork				__sys_fork
>  3	common	read				sys_read
>  4	common	write				sys_write
> @@ -254,7 +254,7 @@
>  244	common	io_submit			sys_io_submit
>  245	common	io_cancel			sys_io_cancel
>  246	common	fadvise64			sys_fadvise64
> -247	common	exit_group			sys_exit_group
> +247	common	exit_group			__sys_exit_group
>  248	common	lookup_dcookie			sys_lookup_dcookie
>  249	common	epoll_create			sys_epoll_create
>  250	common	epoll_ctl			sys_epoll_ctl
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-14  7:13   ` Michael Schmitz
@ 2021-06-14  7:40     ` Andreas Schwab
  2021-06-14  8:19       ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Andreas Schwab @ 2021-06-14  7:40 UTC (permalink / raw)
  To: Michael Schmitz; +Cc: geert, linux-m68k, Kars de Jong

On Jun 14 2021, Michael Schmitz wrote:

> do we need to add
>
> .globl __sys_exit, __sys_exit_group
>
> (and perhaps __sys_clone3) at the start of entry.S?

ENTRY takes care of that.  You wouldn't be able to link without that
anyway.

> We have that for __sys_fork, __sys_clone and __sys_vfork.

Not really needed.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry
  2021-06-14  7:40     ` Andreas Schwab
@ 2021-06-14  8:19       ` Michael Schmitz
  0 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-14  8:19 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: geert, linux-m68k, Kars de Jong

Hi Andreas,

Am 14.06.2021 um 19:40 schrieb Andreas Schwab:
> On Jun 14 2021, Michael Schmitz wrote:
>
>> do we need to add
>>
>> .globl __sys_exit, __sys_exit_group
>>
>> (and perhaps __sys_clone3) at the start of entry.S?
>
> ENTRY takes care of that.  You wouldn't be able to link without that
> anyway.

Thanks - I guessed being able to link (and run) the code should be good 
enough, but I don't try a lot of tool chains...

Cheers,

	Michael

>
>> We have that for __sys_fork, __sys_clone and __sys_vfork.
>
> Not really needed.
>
> Andreas.
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-14  5:03             ` Michael Schmitz
@ 2021-06-14 16:26               ` Eric W. Biederman
  2021-06-14 22:26                 ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-14 16:26 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Michael Schmitz <schmitzmic@gmail.com> writes:

> On second thought, I'm not certain what adding another empty stack frame would
> achieve here.
>
> On m68k, 'frame' already is a new stack frame, for running the new thread
> in. This new frame does not have any user context at all, and it's explicitly
> wiped anyway.
>
> Unless we save all user context on the stack, then push that context to a new
> save frame, and somehow point get_signal to look there for IO threads
> (essentially what Eric suggested), I don't see how this could work?
>
> I must be missing something.

It is only designed to work well enough so that ptrace will access
something well defined when ptrace accesses io_uring tasks.

The io_uring tasks are special in that they are user process
threads that never run in userspace.  So as long as everything
ptrace can read is accessible on that process all is well.

Having stared a bit longer at the code I think the short term
fix for both of PTRACE_EVENT_EXIT and io_uring is to guard
them both with CONFIG_HAVE_ARCH_TRACEHOOK.

Today CONFIG_HAVE_ARCH_TRACEHOOK guards access to /proc/self/syscall.
Which out of necessity ensures that user context is always readable.
Which seems to solve both the PTRACE_EVENT_EXIT and the io_uring
problems.

What I especially like about that is there are a lot of other reasons
to encourage architectures in a CONFIG_HAVE_ARCH_TRACEHOOK direction.
I think the biggies are getting architectures to store the extra
saved state on context switch into some place in task_struct
and to implement the regset view of registers.

Hmm. This is odd. CONFIG_HAVE_ARCH_TRACEHOOK is supposed to imply
CORE_DUMP_USE_REGSET.  But alpha, csky, h8300, m68k, microblaze, nds32
don't implement CORE_DUMP_USE_REGSET but nds32 implements
CONFIG_ARCH_HAVE_TRACEHOOK.

I will keep digging and see what clean code I can come up with.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-14 16:26               ` Eric W. Biederman
@ 2021-06-14 22:26                 ` Michael Schmitz
  2021-06-15 19:30                   ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-14 22:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Hi Eric,

On 15/06/21 4:26 am, Eric W. Biederman wrote:
> Michael Schmitz <schmitzmic@gmail.com> writes:
>
>> On second thought, I'm not certain what adding another empty stack frame would
>> achieve here.
>>
>> On m68k, 'frame' already is a new stack frame, for running the new thread
>> in. This new frame does not have any user context at all, and it's explicitly
>> wiped anyway.
>>
>> Unless we save all user context on the stack, then push that context to a new
>> save frame, and somehow point get_signal to look there for IO threads
>> (essentially what Eric suggested), I don't see how this could work?
>>
>> I must be missing something.
> It is only designed to work well enough so that ptrace will access
> something well defined when ptrace accesses io_uring tasks.
>
> The io_uring tasks are special in that they are user process
> threads that never run in userspace.  So as long as everything
> ptrace can read is accessible on that process all is well.
OK, I'm testing a patch that would save extra context in 
sys_io_uring_setup, which ought to ensure that for m68k.
> Having stared a bit longer at the code I think the short term
> fix for both of PTRACE_EVENT_EXIT and io_uring is to guard
> them both with CONFIG_HAVE_ARCH_TRACEHOOK.

Fair enough :-)

Cheers,

     Michael

>
> Today CONFIG_HAVE_ARCH_TRACEHOOK guards access to /proc/self/syscall.
> Which out of necessity ensures that user context is always readable.
> Which seems to solve both the PTRACE_EVENT_EXIT and the io_uring
> problems.
>
> What I especially like about that is there are a lot of other reasons
> to encourage architectures in a CONFIG_HAVE_ARCH_TRACEHOOK direction.
> I think the biggies are getting architectures to store the extra
> saved state on context switch into some place in task_struct
> and to implement the regset view of registers.
>
> Hmm. This is odd. CONFIG_HAVE_ARCH_TRACEHOOK is supposed to imply
> CORE_DUMP_USE_REGSET.  But alpha, csky, h8300, m68k, microblaze, nds32
> don't implement CORE_DUMP_USE_REGSET but nds32 implements
> CONFIG_ARCH_HAVE_TRACEHOOK.
>
> I will keep digging and see what clean code I can come up with.
>
> Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-14 22:26                 ` Michael Schmitz
@ 2021-06-15 19:30                   ` Eric W. Biederman
  2021-06-15 19:36                     ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman
                                       ` (3 more replies)
  0 siblings, 4 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-15 19:30 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Michael Schmitz <schmitzmic@gmail.com> writes:

> Hi Eric,
>
> On 15/06/21 4:26 am, Eric W. Biederman wrote:
>> Michael Schmitz <schmitzmic@gmail.com> writes:
>>
>>> On second thought, I'm not certain what adding another empty stack frame would
>>> achieve here.
>>>
>>> On m68k, 'frame' already is a new stack frame, for running the new thread
>>> in. This new frame does not have any user context at all, and it's explicitly
>>> wiped anyway.
>>>
>>> Unless we save all user context on the stack, then push that context to a new
>>> save frame, and somehow point get_signal to look there for IO threads
>>> (essentially what Eric suggested), I don't see how this could work?
>>>
>>> I must be missing something.
>> It is only designed to work well enough so that ptrace will access
>> something well defined when ptrace accesses io_uring tasks.
>>
>> The io_uring tasks are special in that they are user process
>> threads that never run in userspace.  So as long as everything
>> ptrace can read is accessible on that process all is well.
> OK, I'm testing a patch that would save extra context in sys_io_uring_setup,
> which ought to ensure that for m68k.

I had to update ret_from_kernel_thread to pop that state to get Linus's
change to boot.  Apparently kernel_threads exiting needs to be handled.

>> Having stared a bit longer at the code I think the short term
>> fix for both of PTRACE_EVENT_EXIT and io_uring is to guard
>> them both with CONFIG_HAVE_ARCH_TRACEHOOK.

Which does not work because nios2 which looks susceptible
sets CONFIG_HAVE_ARCH_TRACEHOOK.

A further look shows that there is also PTRACE_EVENT_EXEC that
needs to be handled so execve and execveat need to be wrapped
as well.

Do you happen to know if there is userspace that will run
in qemu-system-m68k that can be used for testing?

Eric


^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads
  2021-06-15 19:30                   ` Eric W. Biederman
@ 2021-06-15 19:36                     ` Eric W. Biederman
  2021-06-15 22:02                       ` Linus Torvalds
  2021-06-16 20:50                       ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Al Viro
  2021-06-15 20:56                     ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Michael Schmitz
                                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-15 19:36 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


While thinking about the information leaks fixed in 77f6ab8b7768
("don't dump the threads that had been already exiting when zapped.")
I realized the problem is much more general than just coredumps and
exit_mm.  We have io_uring threads, PTRACE_EVENT_EXEC and
PTRACE_EVENT_EXIT where ptrace is allowed to access userspace
registers, but on some architectures has not saved them.

The function alpha_switch_to does something reasonable it saves the
floating point registers and the caller saved registers and switches
to a different thread.  Any register the caller is not expected to
save it does not save.

Meanhile the system call entry point on alpha also does something
reasonable.  The system call entry point saves the all but the caller
saved integer registers and doesn't touch the floating point registers
as the kernel code does not touch them.

This is a nice happy fast path until the kernel wants to access the
user space's registers through ptrace or similar.  As user spaces's
caller saved registers may be saved at an unpredictable point in the
kernel code's stack, the routime which may stop and make the userspace
registers available must be wrapped by code that will first save a
switch stack frame at the bottom of the call stack, call the code that
may access those registers and then pop the switch stack frame.

The practical problem with this code structure is that this results in
a game of whack-a-mole wrapping different kernel system calls.  Loosing
the game of whack-a-mole results in a security hole where userspace can
write arbitrary data to the kernel stack.

I looked and there nothing I can do that is not arch specific, so
whack the moles with a minimal backportable fix.

This change survives boot testing on qemu-system-alpha.

Cc: stable@vger.kernel.org
Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/alpha/kernel/entry.S              | 21 +++++++++++++++++++++
 arch/alpha/kernel/process.c            | 11 ++++++++++-
 arch/alpha/kernel/syscalls/syscall.tbl |  8 ++++----
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index e227f3a29a43..98bb5b805089 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -785,6 +785,7 @@ ret_from_kernel_thread:
 	mov	$9, $27
 	mov	$10, $16
 	jsr	$26, ($9)
+	lda	$sp, SWITCH_STACK_SIZE($sp)
 	br	$31, ret_to_user
 .end ret_from_kernel_thread
 
@@ -811,6 +812,26 @@ alpha_\name:
 fork_like fork
 fork_like vfork
 fork_like clone
+fork_like exit
+fork_like exit_group
+
+.macro	exec_like name
+	.align	4
+	.globl	alpha_\name
+	.ent	alpha_\name
+	.cfi_startproc
+alpha_\name:
+	.prologue 0
+	DO_SWITCH_STACK
+	jsr	$26, sys_\name
+	UNDO_SWITCH_STACK
+	ret
+	.cfi_endproc
+.end	alpha_\name
+.endm
+
+exec_like execve
+exec_like execveat
 
 .macro	sigreturn_like name
 	.align	4
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5112ab996394..edbfe03f4b2c 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -251,8 +251,17 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
+		/*
+		 * Give it *two* switch stacks, one for the kernel
+		 * state return that is used up by alpha_switch_to,
+		 * and one for the "user state" which is accessed
+		 * by ptrace.
+		 */
+		childstack--;
+		childti->pcb.ksp = (unsigned long) childstack;
+
 		memset(childstack, 0,
-			sizeof(struct switch_stack) + sizeof(struct pt_regs));
+			2*sizeof(struct switch_stack) + sizeof(struct pt_regs));
 		childstack->r26 = (unsigned long) ret_from_kernel_thread;
 		childstack->r9 = usp;	/* function */
 		childstack->r10 = kthread_arg;
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..5f85f3c11ed4 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -8,7 +8,7 @@
 # The <abi> is always "common" for this file
 #
 0	common	osf_syscall			alpha_syscall_zero
-1	common	exit				sys_exit
+1	common	exit				alpha_exit
 2	common	fork				alpha_fork
 3	common	read				sys_read
 4	common	write				sys_write
@@ -65,7 +65,7 @@
 56	common	osf_revoke			sys_ni_syscall
 57	common	symlink				sys_symlink
 58	common	readlink			sys_readlink
-59	common	execve				sys_execve
+59	common	execve				alpha_execve
 60	common	umask				sys_umask
 61	common	chroot				sys_chroot
 62	common	osf_old_fstat			sys_ni_syscall
@@ -333,7 +333,7 @@
 400	common	io_getevents			sys_io_getevents
 401	common	io_submit			sys_io_submit
 402	common	io_cancel			sys_io_cancel
-405	common	exit_group			sys_exit_group
+405	common	exit_group			alpha_exit_group
 406	common	lookup_dcookie			sys_lookup_dcookie
 407	common	epoll_create			sys_epoll_create
 408	common	epoll_ctl			sys_epoll_ctl
@@ -441,7 +441,7 @@
 510	common	renameat2			sys_renameat2
 511	common	getrandom			sys_getrandom
 512	common	memfd_create			sys_memfd_create
-513	common	execveat			sys_execveat
+513	common	execveat			alpha_execveat
 514	common	seccomp				sys_seccomp
 515	common	bpf				sys_bpf
 516	common	userfaultfd			sys_userfaultfd
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-15 19:30                   ` Eric W. Biederman
  2021-06-15 19:36                     ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman
@ 2021-06-15 20:56                     ` Michael Schmitz
  2021-06-16  0:23                       ` Finn Thain
  2021-06-15 21:58                     ` Linus Torvalds
  2021-06-16  7:38                     ` Geert Uytterhoeven
  3 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-15 20:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook,
	John Paul Adrian Glaubitz

Hi Eric,

On 16/06/21 7:30 am, Eric W. Biederman wrote:
>
>>> The io_uring tasks are special in that they are user process
>>> threads that never run in userspace.  So as long as everything
>>> ptrace can read is accessible on that process all is well.
>> OK, I'm testing a patch that would save extra context in sys_io_uring_setup,
>> which ought to ensure that for m68k.
> I had to update ret_from_kernel_thread to pop that state to get Linus's
> change to boot.  Apparently kernel_threads exiting needs to be handled.

Hadn't yet got to that stage, sorry. Still stress testing stage 1 of my 
fix (push complete context). I would have thought that this should be 
sufficient (gives us a complete stack frame for ptrace code to work on)?

But it makes sense that when you push an extra stack frame, you'd need 
to pop that on exit.

>
>>> Having stared a bit longer at the code I think the short term
>>> fix for both of PTRACE_EVENT_EXIT and io_uring is to guard
>>> them both with CONFIG_HAVE_ARCH_TRACEHOOK.
> Which does not work because nios2 which looks susceptible
> sets CONFIG_HAVE_ARCH_TRACEHOOK.
>
> A further look shows that there is also PTRACE_EVENT_EXEC that
> needs to be handled so execve and execveat need to be wrapped
> as well.
>
> Do you happen to know if there is userspace that will run
> in qemu-system-m68k that can be used for testing?

I surmise so. I don't use qemu myself - either ARAnyM, or actual 
hardware. Hardware is limited to 14 MB RAM, which has prevented me from 
using more than simple regression testing. In particular, I can't test 
sys_io_uring_setup there.

Adrian uses qemu a lot, and has supplied disk images to work from on 
occasion. Maybe he's got something recent enough to support 
sys_io_uring_setup ... I've CC:ed him in, as I'd love to do some more 
testing as well.

Cheers,

     Michael

>
> Eric
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-15 19:30                   ` Eric W. Biederman
  2021-06-15 19:36                     ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman
  2021-06-15 20:56                     ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Michael Schmitz
@ 2021-06-15 21:58                     ` Linus Torvalds
  2021-06-16 15:06                       ` Eric W. Biederman
  2021-06-21 13:54                       ` Al Viro
  2021-06-16  7:38                     ` Geert Uytterhoeven
  3 siblings, 2 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-15 21:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Tue, Jun 15, 2021 at 12:32 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I had to update ret_from_kernel_thread to pop that state to get Linus's
> change to boot.  Apparently kernel_threads exiting needs to be handled.

You are very right.

That, btw, seems to be a horrible design mistake, but I think it's how
"kernel_execve()" works - both for the initial "init", but also for
user-mode helper processes.

Both of those cases do "kernel_thread()" to create a new thread, and
then that new kernel thread does kernel_execve() to create the user
space image for that thread. And that act of "execve()" clears
PF_KTHREAD from the thread, and then the final return from the kernel
thread function returns to that new user space.

Or something like that. It's been ages since I looked at that code,
and your patch initially confused the heck out of me because I went
"that can't _possibly_ be needed".

But yes, I think your patch is right.

And I think our horrible "kernel threads return to user space when
done" is absolutely horrifically nasty. Maybe of the clever sort, but
mostly of the historical horror sort.

Or am I mis-rememberting how this ends up working? Did you look at
exactly what it was that returned from kernel threads?

This might be worth commenting on somewhere. But your patch for alpha
looks correct to me. Did you have some test-case to verify ptrace() on
io worker threads?

             Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads
  2021-06-15 19:36                     ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman
@ 2021-06-15 22:02                       ` Linus Torvalds
  2021-06-16 16:32                         ` Eric W. Biederman
  2021-06-16 20:50                       ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Al Viro
  1 sibling, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-15 22:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Tue, Jun 15, 2021 at 12:36 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I looked and there nothing I can do that is not arch specific, so
> whack the moles with a minimal backportable fix.
>
> This change survives boot testing on qemu-system-alpha.

So as mentioned in the other thread, I think this patch is exactly right.

However, the need for this part

> @@ -785,6 +785,7 @@ ret_from_kernel_thread:
>         mov     $9, $27
>         mov     $10, $16
>         jsr     $26, ($9)
> +       lda     $sp, SWITCH_STACK_SIZE($sp)
>         br      $31, ret_to_user
>  .end ret_from_kernel_thread

obviously eluded me in my "how about something like this", and I had
to really try to figure out why we'd ever return.

Which is why I came to that "oooh - kernel_execve()" realization.

It might be good to comment on that somewhere. And if you can think of
some other case, that should be mentioned too.

Anyway, thanks for looking into this odd case. And if you have a
test-case for this all, it really would be a good thing. Yes, it
should only affect a couple of odd-ball architectures, but still... It
would also be good to hear that you actually did verify the behavior
of this patch wrt that ptrace-of-io-worker-threads case..

           Linus

               Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-15 20:56                     ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Michael Schmitz
@ 2021-06-16  0:23                       ` Finn Thain
  0 siblings, 0 replies; 119+ messages in thread
From: Finn Thain @ 2021-06-16  0:23 UTC (permalink / raw)
  To: Eric W. Biederman, Michael Schmitz
  Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook,
	John Paul Adrian Glaubitz

On Wed, 16 Jun 2021, Michael Schmitz wrote:

> > 
> > Do you happen to know if there is userspace that will run in 
> > qemu-system-m68k that can be used for testing?
> 
> I surmise so. I don't use qemu myself - either ARAnyM, or actual 
> hardware. Hardware is limited to 14 MB RAM, which has prevented me from 
> using more than simple regression testing. In particular, I can't test 
> sys_io_uring_setup there.
> 
> Adrian uses qemu a lot, and has supplied disk images to work from on 
> occasion. Maybe he's got something recent enough to support 
> sys_io_uring_setup ... I've CC:ed him in, as I'd love to do some more 
> testing as well.
> 

As well as Debian/m68k, there is also a Gentoo/m68k stage3 rootfs 
available here: 
https://sourceforge.net/projects/linux-mac68k/files/Gentoo%20m68k%20unauthorized/

I built that rootfs last year using Catalyst. Some background (including 
the qemu-system-m68k command-line) can be found here:
https://forums.gentoo.org/viewtopic-t-1100780.html

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-15 19:30                   ` Eric W. Biederman
                                       ` (2 preceding siblings ...)
  2021-06-15 21:58                     ` Linus Torvalds
@ 2021-06-16  7:38                     ` Geert Uytterhoeven
  2021-06-16 19:40                       ` Michael Schmitz
  3 siblings, 1 reply; 119+ messages in thread
From: Geert Uytterhoeven @ 2021-06-16  7:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, Linus Torvalds, linux-arch, Jens Axboe,
	Oleg Nesterov, Al Viro, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Hi Eric,

On Tue, Jun 15, 2021 at 9:32 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Do you happen to know if there is userspace that will run
> in qemu-system-m68k that can be used for testing?

There's a link to an image in Laurent's patch series "[PATCH 0/2]
m68k: Add Virtual M68k Machine"
https://lore.kernel.org/linux-m68k/20210323221430.3735147-1-laurent@vivier.eu/

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-15 21:58                     ` Linus Torvalds
@ 2021-06-16 15:06                       ` Eric W. Biederman
  2021-06-21 13:54                       ` Al Viro
  1 sibling, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 15:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jun 15, 2021 at 12:32 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> I had to update ret_from_kernel_thread to pop that state to get Linus's
>> change to boot.  Apparently kernel_threads exiting needs to be handled.
>
> You are very right.
>
> That, btw, seems to be a horrible design mistake, but I think it's how
> "kernel_execve()" works - both for the initial "init", but also for
> user-mode helper processes.
>
> Both of those cases do "kernel_thread()" to create a new thread, and
> then that new kernel thread does kernel_execve() to create the user
> space image for that thread. And that act of "execve()" clears
> PF_KTHREAD from the thread, and then the final return from the kernel
> thread function returns to that new user space.
>
> Or something like that. It's been ages since I looked at that code,
> and your patch initially confused the heck out of me because I went
> "that can't _possibly_ be needed".
>
> But yes, I think your patch is right.
>
> And I think our horrible "kernel threads return to user space when
> done" is absolutely horrifically nasty. Maybe of the clever sort, but
> mostly of the historical horror sort.
>
> Or am I mis-rememberting how this ends up working? Did you look at
> exactly what it was that returned from kernel threads?
>
> This might be worth commenting on somewhere. But your patch for alpha
> looks correct to me. Did you have some test-case to verify ptrace() on
> io worker threads?

At this point I just booted an alpha image and on qemu-system-alpha.

I do have gdb in my kernel image so I can test that.  I haven't yet but
I can and should.

Sleeping on it I came up with a plan to add TF_SWITCH_STACK_SAVED to
indicate if the registers have been saved.  The DO_SWITCH_STACK and
UNDO_SWITCH_STACK helpers (except in alpha_switch_to) can test that.
The ptrace helpers can test that and turn an access of random kernel
stack contents into something well behaved that does WARN_ON_ONCE
because we should not get there.

I suspect adding TF_SWITCH_STACK_SAVED should come first so it
is easy to verify the problem behavior, before I fix it.

My real goal is to find a pattern that architectures whose register
saves are structured like alphas can emulate, to minimize problems in
the future.

Plus I would really like to get the last handful of architectures
updated so we can remove CONFIG_HAVE_ARCH_TRACEHOOK.  I think we can
do that on alpha because we save all of the system call arguments
in pt_regs and that is all the other non-ptrace code paths care about.

AKA I am trying to move the old architectures forward so we can get rid
of unnecessary complications in the core code.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads
  2021-06-15 22:02                       ` Linus Torvalds
@ 2021-06-16 16:32                         ` Eric W. Biederman
  2021-06-16 18:29                           ` [PATCH 0/2] alpha/ptrace: Improved switch_stack handling Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 16:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jun 15, 2021 at 12:36 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> I looked and there nothing I can do that is not arch specific, so
>> whack the moles with a minimal backportable fix.
>>
>> This change survives boot testing on qemu-system-alpha.
>
> So as mentioned in the other thread, I think this patch is exactly right.
>
> However, the need for this part
>
>> @@ -785,6 +785,7 @@ ret_from_kernel_thread:
>>         mov     $9, $27
>>         mov     $10, $16
>>         jsr     $26, ($9)
>> +       lda     $sp, SWITCH_STACK_SIZE($sp)
>>         br      $31, ret_to_user
>>  .end ret_from_kernel_thread
>
> obviously eluded me in my "how about something like this", and I had
> to really try to figure out why we'd ever return.
>
> Which is why I came to that "oooh - kernel_execve()" realization.
>
> It might be good to comment on that somewhere. And if you can think of
> some other case, that should be mentioned too.
>
> Anyway, thanks for looking into this odd case. And if you have a
> test-case for this all, it really would be a good thing. Yes, it
> should only affect a couple of odd-ball architectures, but still... It
> would also be good to hear that you actually did verify the behavior
> of this patch wrt that ptrace-of-io-worker-threads case..

*Grumble*

So just going through and looking to see what it takes to instrument
and put in warnings when things go wrong I have found another issue.

Today there exists:
PTRACE_EVENT_FORK
PTRACE_EVENT_VFORK
PTRACE_EVENT_CLONE

Which happens after the actual fork operation in the kernel.

The following code wraps those operations in arch/alpha/kernel/entry.S

.macro	fork_like name
	.align	4
	.globl	alpha_\name
	.ent	alpha_\name
alpha_\name:
	.prologue 0
	bsr	$1, do_switch_stack
	jsr	$26, sys_\name
	ldq	$26, 56($sp)
	lda	$sp, SWITCH_STACK_SIZE($sp)
	ret
.end	alpha_\name
.endm

The code in the kernel when calls in fork.c calls ptrace_event_pid
which ultimately calls ptrace_stop.  So userspace can reasonably expect
to stop the process and change it's registers.

With unconditionally popping the switch stack any of those registers
that are modified are lost.

So I will update my changes to handle that case as well.


Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH 0/2] alpha/ptrace: Improved switch_stack handling
  2021-06-16 16:32                         ` Eric W. Biederman
@ 2021-06-16 18:29                           ` Eric W. Biederman
  2021-06-16 18:31                             ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman
  2021-06-16 18:32                             ` [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames Eric W. Biederman
  0 siblings, 2 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 18:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


This pair of changes has not received anything beyond build and boot
testing.  I am posting these changes as they do a much better job of
warning of problems and shutting down the security hole.  Making them
a much better pattern than the my last patch.

I hope to get the test cases soon.

 arch/alpha/include/asm/thread_info.h   |  2 ++
 arch/alpha/kernel/entry.S              | 62 ++++++++++++++++++++++++++--------
 arch/alpha/kernel/process.c            |  3 ++
 arch/alpha/kernel/ptrace.c             | 13 +++++--
 arch/alpha/kernel/syscalls/syscall.tbl |  8 ++---
 5 files changed, 67 insertions(+), 21 deletions(-)

Eric W. Biederman (2):
      alpha/ptrace: Record and handle the absence of switch_stack
      alpha/ptrace: Add missing switch_stack frames


^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 18:29                           ` [PATCH 0/2] alpha/ptrace: Improved switch_stack handling Eric W. Biederman
@ 2021-06-16 18:31                             ` Eric W. Biederman
  2021-06-16 20:00                               ` Linus Torvalds
                                                 ` (2 more replies)
  2021-06-16 18:32                             ` [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames Eric W. Biederman
  1 sibling, 3 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


While thinking about the information leaks fixed in 77f6ab8b7768
("don't dump the threads that had been already exiting when zapped.")
I realized the problem is much more general than just coredumps and
exit_mm.  We have io_uring threads, PTRACE_EVENT_FORK,
PTRACE_EVENT_VFORK, PTRACE_EVENT_CLONE, PTRACE_EVENT_EXEC and
PTRACE_EVENT_EXIT where ptrace is allowed to access userspace
registers, but on some architectures has not saved them so
they can be modified.

The function alpha_switch_to does something reasonable it saves the
floating point registers and the caller saved registers and switches
to a different thread.  Any register the caller is not expected to
save it does not save.

Meanhile the system call entry point on alpha also does something
reasonable.  The system call entry point saves all but the caller
saved integer registers and doesn't touch the floating point registers
as the kernel code does not touch them.

This is a nice happy fast path until the kernel wants to access the
user space's registers through ptrace or similar.  As user spaces's
caller saved registers may be saved at an unpredictable point in the
kernel code's stack, the routine which may stop and make the userspace
registers available must be wrapped by code that will first save a
switch stack frame at the bottom of the call stack, call the code that
may access those registers and then pop the switch stack frame.

The practical problem with this code structure is that this results in
a game of whack-a-mole wrapping different kernel system calls.  Loosing
the game of whack-a-mole results in a security hole where userspace can
write arbitrary data to the kernel stack.

In general it is not possible to prevent generic code introducing a
ptrace_stop or register access not knowing alpha's limitations, that
where alpha does not make all of the registers avaliable.

Prevent security holes by recording when all of the registers are
available so generic code changes do not result in security holes
on alpha.

Cc: stable@vger.kernel.org
Fixes: dbe1bdbb39db ("io_uring: handle signals for IO threads like a normal thread")
Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/alpha/include/asm/thread_info.h |  2 ++
 arch/alpha/kernel/entry.S            | 38 ++++++++++++++++++++++------
 arch/alpha/kernel/ptrace.c           | 13 ++++++++--
 3 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 2592356e3215..41e5986ed9c8 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -63,6 +63,7 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SYSCALL_AUDIT	4	/* syscall audit active */
 #define TIF_NOTIFY_SIGNAL	5	/* signal notifications exist */
+#define TIF_ALLREGS_SAVED	6	/* both pt_regs and switch_stack saved */
 #define TIF_DIE_IF_KERNEL	9	/* dik recursion lock */
 #define TIF_MEMDIE		13	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	14	/* idle is polling for TIF_NEED_RESCHED */
@@ -73,6 +74,7 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
 #define _TIF_NOTIFY_SIGNAL	(1<<TIF_NOTIFY_SIGNAL)
+#define _TIF_ALLREGS_SAVED	(1<<TIF_ALLREGS_SAVED)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 
 /* Work to do on interrupt/exception return.  */
diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index e227f3a29a43..c1edf54dc035 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -174,6 +174,28 @@
 	.cfi_adjust_cfa_offset	-SWITCH_STACK_SIZE
 .endm
 
+.macro	SAVE_SWITCH_STACK
+	DO_SWITCH_STACK
+1:	ldl_l	$1, TI_FLAGS($8)
+	bis	$1, _TIF_ALLREGS_SAVED, $1
+	stl_c	$1, TI_FLAGS($8)
+	beq	$1, 2f
+.subsection 2
+2:	br	1b
+.previous
+.endm
+
+.macro	RESTORE_SWITCH_STACK
+1:	ldl_l	$1, TI_FLAGS($8)
+	bic	$1, _TIF_ALLREGS_SAVED, $1
+	stl_c	$1, TI_FLAGS($8)
+	beq	$1, 2f
+.subsection 2
+2:	br	1b
+.previous
+	UNDO_SWITCH_STACK
+.endm
+
 /*
  * Non-syscall kernel entry points.
  */
@@ -559,9 +581,9 @@ $work_resched:
 
 $work_notifysig:
 	mov	$sp, $16
-	DO_SWITCH_STACK
+	SAVE_SWITCH_STACK
 	jsr	$26, do_work_pending
-	UNDO_SWITCH_STACK
+	RESTORE_SWITCH_STACK
 	br	restore_all
 
 /*
@@ -572,9 +594,9 @@ $work_notifysig:
 	.type	strace, @function
 strace:
 	/* set up signal stack, call syscall_trace */
-	DO_SWITCH_STACK
+	SAVE_SWITCH_STACK
 	jsr	$26, syscall_trace_enter /* returns the syscall number */
-	UNDO_SWITCH_STACK
+	RESTORE_SWITCH_STACK
 
 	/* get the arguments back.. */
 	ldq	$16, SP_OFF+24($sp)
@@ -602,9 +624,9 @@ ret_from_straced:
 $strace_success:
 	stq	$0, 0($sp)		/* save return value */
 
-	DO_SWITCH_STACK
+	SAVE_SWITCH_STACK
 	jsr	$26, syscall_trace_leave
-	UNDO_SWITCH_STACK
+	RESTORE_SWITCH_STACK
 	br	$31, ret_from_sys_call
 
 	.align	3
@@ -618,13 +640,13 @@ $strace_error:
 	stq	$0, 0($sp)
 	stq	$1, 72($sp)	/* a3 for return */
 
-	DO_SWITCH_STACK
+	SAVE_SWITCH_STACK
 	mov	$18, $9		/* save old syscall number */
 	mov	$19, $10	/* save old a3 */
 	jsr	$26, syscall_trace_leave
 	mov	$9, $18
 	mov	$10, $19
-	UNDO_SWITCH_STACK
+	RESTORE_SWITCH_STACK
 
 	mov	$31, $26	/* tell "ret_from_sys_call" we can restart */
 	br	ret_from_sys_call
diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index 8c43212ae38e..41fb994f36dc 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -117,7 +117,13 @@ get_reg_addr(struct task_struct * task, unsigned long regno)
 		zero = 0;
 		addr = &zero;
 	} else {
-		addr = task_stack_page(task) + regoff[regno];
+		int off = regoff[regno];
+		if (WARN_ON_ONCE((off < PT_REG(r0)) &&
+				!test_ti_thread_flag(task_thread_info(task),
+						     TIF_ALLREGS_SAVED)))
+			addr = &zero;
+		else
+			addr = task_stack_page(task) + off;
 	}
 	return addr;
 }
@@ -145,13 +151,16 @@ get_reg(struct task_struct * task, unsigned long regno)
 static int
 put_reg(struct task_struct *task, unsigned long regno, unsigned long data)
 {
+	unsigned long *addr;
 	if (regno == 63) {
 		task_thread_info(task)->ieee_state
 		  = ((task_thread_info(task)->ieee_state & ~IEEE_SW_MASK)
 		     | (data & IEEE_SW_MASK));
 		data = (data & FPCR_DYN_MASK) | ieee_swcr_to_fpcr(data);
 	}
-	*get_reg_addr(task, regno) = data;
+	addr = get_reg_addr(task, regno);
+	if (addr != &zero)
+		*addr = data;
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames
  2021-06-16 18:29                           ` [PATCH 0/2] alpha/ptrace: Improved switch_stack handling Eric W. Biederman
  2021-06-16 18:31                             ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman
@ 2021-06-16 18:32                             ` Eric W. Biederman
  2021-06-16 20:25                               ` Al Viro
  1 sibling, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 18:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


With the introduction of PTRACE_EVENT_XXX flags during the 2.5
development cycle it became possible for ptrace to write arbitrary
data to alpha kernel stack frames because it was assumed that wherever
ptrace_stop was called both a pt_regs and a switch_stack were saved
upon the stack.

The introduction of TIF_ALLREGS_SAVED removed the assumption that
switch_stack was saved on the kernel thread by transforming the
problem into a lesser bug where the access simply don't work.

Saving struct switch_stack has to happen on the lowest level of the
stack on alpha because it contains caller saved registers, which will
be saved by the C code in arbitrary locations on the stack if the data
is not saved immediately.

Update kernel threads to save a full set of userspace registers on
the stack so that io_uring threads can be ptraced.

Update fork, vfork, clone, exit, exit_group, execve, and execveat to
save all of the userspace registers when the are called as there are
known PTRACE_EVENT_XXX ptrace stop points in those functions where
registers can be manipulated.

The switch_stack frames serve double duty in fork, vfork, and clone,
as both the the childs inputs to alpha_switch_to, and the parents
saved copy of the registers for debuggers to modify.  This changes
marks the the frame is present in the parent, and clears
TIF_ALLREGS_SAVED in the child as alpha_switch_to will consume the
switch_stack when the child is started.

Cc: stable@vger.kernel.org
Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: dbe1bdbb39db ("io_uring: handle signals for IO threads like a normal thread")
Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/alpha/kernel/entry.S              | 24 +++++++++++++++++-------
 arch/alpha/kernel/process.c            |  3 +++
 arch/alpha/kernel/syscalls/syscall.tbl |  8 ++++----
 3 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index c1edf54dc035..f29a40e2daf1 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -801,13 +801,18 @@ ret_from_fork:
 	.align 4
 	.globl	ret_from_kernel_thread
 	.ent	ret_from_kernel_thread
+	.cfi_startproc
 ret_from_kernel_thread:
 	mov	$17, $16
 	jsr	$26, schedule_tail
+	/* PF_IO_WORKER threads can be ptraced */
+	SAVE_SWITCH_STACK
 	mov	$9, $27
 	mov	$10, $16
 	jsr	$26, ($9)
+	RESTORE_SWITCH_STACK
 	br	$31, ret_to_user
+	.cfi_endproc
 .end ret_from_kernel_thread
 
 \f
@@ -816,23 +821,28 @@ ret_from_kernel_thread:
  * have to play switch_stack games.
  */
 
-.macro	fork_like name
+.macro	allregs name
 	.align	4
 	.globl	alpha_\name
 	.ent	alpha_\name
+	.cfi_startproc
 alpha_\name:
 	.prologue 0
-	bsr	$1, do_switch_stack
+	SAVE_SWITCH_STACK
 	jsr	$26, sys_\name
-	ldq	$26, 56($sp)
-	lda	$sp, SWITCH_STACK_SIZE($sp)
+	RESTORE_SWITCH_STACK
 	ret
+	.cfi_endproc
 .end	alpha_\name
 .endm
 
-fork_like fork
-fork_like vfork
-fork_like clone
+allregs fork
+allregs vfork
+allregs clone
+allregs exit
+allregs exit_group
+allregs execve
+allregs execveat
 
 .macro	sigreturn_like name
 	.align	4
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5112ab996394..3bf480468a89 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,6 +249,9 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
 
+	/* In the child the registers are consumed by alpha_switch_to */
+	clear_ti_thread_flag(childti, TIF_ALLREGS_SAVED);
+
 	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
 		memset(childstack, 0,
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..5f85f3c11ed4 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -8,7 +8,7 @@
 # The <abi> is always "common" for this file
 #
 0	common	osf_syscall			alpha_syscall_zero
-1	common	exit				sys_exit
+1	common	exit				alpha_exit
 2	common	fork				alpha_fork
 3	common	read				sys_read
 4	common	write				sys_write
@@ -65,7 +65,7 @@
 56	common	osf_revoke			sys_ni_syscall
 57	common	symlink				sys_symlink
 58	common	readlink			sys_readlink
-59	common	execve				sys_execve
+59	common	execve				alpha_execve
 60	common	umask				sys_umask
 61	common	chroot				sys_chroot
 62	common	osf_old_fstat			sys_ni_syscall
@@ -333,7 +333,7 @@
 400	common	io_getevents			sys_io_getevents
 401	common	io_submit			sys_io_submit
 402	common	io_cancel			sys_io_cancel
-405	common	exit_group			sys_exit_group
+405	common	exit_group			alpha_exit_group
 406	common	lookup_dcookie			sys_lookup_dcookie
 407	common	epoll_create			sys_epoll_create
 408	common	epoll_ctl			sys_epoll_ctl
@@ -441,7 +441,7 @@
 510	common	renameat2			sys_renameat2
 511	common	getrandom			sys_getrandom
 512	common	memfd_create			sys_memfd_create
-513	common	execveat			sys_execveat
+513	common	execveat			alpha_execveat
 514	common	seccomp				sys_seccomp
 515	common	bpf				sys_bpf
 516	common	userfaultfd			sys_userfaultfd
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-16  7:38                     ` Geert Uytterhoeven
@ 2021-06-16 19:40                       ` Michael Schmitz
  0 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-16 19:40 UTC (permalink / raw)
  To: Geert Uytterhoeven, Eric W. Biederman
  Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, linux-m68k, Arnd Bergmann, Ley Foon Tan,
	Tejun Heo, Kees Cook

Hi Geert,

On 16/06/21 7:38 pm, Geert Uytterhoeven wrote:
> Hi Eric,
>
> On Tue, Jun 15, 2021 at 9:32 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Do you happen to know if there is userspace that will run
>> in qemu-system-m68k that can be used for testing?
> There's a link to an image in Laurent's patch series "[PATCH 0/2]
> m68k: Add Virtual M68k Machine"
> https://lore.kernel.org/linux-m68k/20210323221430.3735147-1-laurent@vivier.eu/

Thanks, I'll try that one.

I'll try and implement a few of the solutions Eric came up with for 
alpha, unless someone beats me to it (Andreas?).

Cheers,

     Michael


>
> Gr{oetje,eeting}s,
>
>                          Geert
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 18:31                             ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman
@ 2021-06-16 20:00                               ` Linus Torvalds
  2021-06-16 20:37                                 ` Linus Torvalds
  2021-06-16 20:42                                 ` Eric W. Biederman
  2021-06-16 20:17                               ` Al Viro
  2021-06-21  2:01                               ` Michael Schmitz
  2 siblings, 2 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-16 20:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 11:32 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Prevent security holes by recording when all of the registers are
> available so generic code changes do not result in security holes
> on alpha.

Please no, not this way. ldl/stc is extremely expensive on some alpha cpus.

I really think thatTIF_ALLREGS_SAVED bit isn't worth it, except
perhaps for debugging.

And even for debugging, I think it would be both easier and cheaper to
just add a magic word to the entry stack instead.

              Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 18:31                             ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman
  2021-06-16 20:00                               ` Linus Torvalds
@ 2021-06-16 20:17                               ` Al Viro
  2021-06-21  2:01                               ` Michael Schmitz
  2 siblings, 0 replies; 119+ messages in thread
From: Al Viro @ 2021-06-16 20:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 01:31:52PM -0500, Eric W. Biederman wrote:

> +.macro	SAVE_SWITCH_STACK
> +	DO_SWITCH_STACK
> +1:	ldl_l	$1, TI_FLAGS($8)
> +	bis	$1, _TIF_ALLREGS_SAVED, $1
> +	stl_c	$1, TI_FLAGS($8)
> +	beq	$1, 2f
> +.subsection 2
> +2:	br	1b
> +.previous
> +.endm

What the hell?  *IF* you are going to go that way, at least put it into
->status, not ->flag - those are thread-synchronous and do not require that
kind of masturbation.

> +.macro	RESTORE_SWITCH_STACK
> +1:	ldl_l	$1, TI_FLAGS($8)
> +	bic	$1, _TIF_ALLREGS_SAVED, $1
> +	stl_c	$1, TI_FLAGS($8)
> +	beq	$1, 2f
> +.subsection 2
> +2:	br	1b
> +.previous
> +	UNDO_SWITCH_STACK
> +.endm

Ditto.  What do you need that flag for, anyway?

> @@ -117,7 +117,13 @@ get_reg_addr(struct task_struct * task, unsigned long regno)
>  		zero = 0;
>  		addr = &zero;
>  	} else {
> -		addr = task_stack_page(task) + regoff[regno];
> +		int off = regoff[regno];
> +		if (WARN_ON_ONCE((off < PT_REG(r0)) &&
> +				!test_ti_thread_flag(task_thread_info(task),
> +						     TIF_ALLREGS_SAVED)))
> +			addr = &zero;
> +		else
> +			addr = task_stack_page(task) + off;

A sanity check in slow path, buggering the hell out of a fast path?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames
  2021-06-16 18:32                             ` [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames Eric W. Biederman
@ 2021-06-16 20:25                               ` Al Viro
  2021-06-16 20:28                                 ` Al Viro
  2021-06-16 20:47                                 ` Eric W. Biederman
  0 siblings, 2 replies; 119+ messages in thread
From: Al Viro @ 2021-06-16 20:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 01:32:50PM -0500, Eric W. Biederman wrote:

> -.macro	fork_like name
> +.macro	allregs name
>  	.align	4
>  	.globl	alpha_\name
>  	.ent	alpha_\name
> +	.cfi_startproc
>  alpha_\name:
>  	.prologue 0
> -	bsr	$1, do_switch_stack
> +	SAVE_SWITCH_STACK
>  	jsr	$26, sys_\name
> -	ldq	$26, 56($sp)
> -	lda	$sp, SWITCH_STACK_SIZE($sp)
> +	RESTORE_SWITCH_STACK

	No.  You've just added one hell of an overhead to fork(2),
for no reason whatsoever.  sys_fork() et.al. does *NOT* modify the
callee-saved registers; it's plain C.  So this change is complete
BS.

> +allregs exit
> +allregs exit_group

	Details, please - what exactly makes exit(2) different from
e.g. open(2)?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames
  2021-06-16 20:25                               ` Al Viro
@ 2021-06-16 20:28                                 ` Al Viro
  2021-06-16 20:49                                   ` Eric W. Biederman
  2021-06-16 20:47                                 ` Eric W. Biederman
  1 sibling, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-16 20:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 08:25:35PM +0000, Al Viro wrote:
> On Wed, Jun 16, 2021 at 01:32:50PM -0500, Eric W. Biederman wrote:
> 
> > -.macro	fork_like name
> > +.macro	allregs name
> >  	.align	4
> >  	.globl	alpha_\name
> >  	.ent	alpha_\name
> > +	.cfi_startproc
> >  alpha_\name:
> >  	.prologue 0
> > -	bsr	$1, do_switch_stack
> > +	SAVE_SWITCH_STACK
> >  	jsr	$26, sys_\name
> > -	ldq	$26, 56($sp)
> > -	lda	$sp, SWITCH_STACK_SIZE($sp)
> > +	RESTORE_SWITCH_STACK
> 
> 	No.  You've just added one hell of an overhead to fork(2),
> for no reason whatsoever.  sys_fork() et.al. does *NOT* modify the
> callee-saved registers; it's plain C.  So this change is complete
> BS.
> 
> > +allregs exit
> > +allregs exit_group
> 
> 	Details, please - what exactly makes exit(2) different from
> e.g. open(2)?

Ah... PTRACE_EVENT_EXIT garbage, fortunately having no counterparts in case of
open(2)...  Still, WTF would you want to restore callee-saved registers for
in case of exit(2)?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 20:00                               ` Linus Torvalds
@ 2021-06-16 20:37                                 ` Linus Torvalds
  2021-06-16 20:57                                   ` Eric W. Biederman
  2021-06-16 20:42                                 ` Eric W. Biederman
  1 sibling, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-16 20:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 1:00 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And even for debugging, I think it would be both easier and cheaper to
> just add a magic word to the entry stack instead.

IOW, just add a

      unsigned long magic;

to "struct switch_stack", and then make the stack switch code push that value.

That would be cheap enough to be just unconditional, but you could
make it depend on a debug config option too, of course.

It helps if 'xyz' is some constant that is easyish to generate. It
might not be a constant - maybe it could be the address of that
'magic' field itself, so you'd just generate it with

    stq $r,($r)

and it would be equally easy to just validate at lookup for that WARN_ON_ONCE():

    WARN_ON_ONCE(switch_stack->magic != (unsigned long)&switch_stack->magic);

or whatever.

It's for debugging, not security. So it doesn't have to be some kind
of super-great magic number, just something easy to generate and check
(that isn't a common value like "0" that trivially exist on the stack
anyway).

             Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 20:00                               ` Linus Torvalds
  2021-06-16 20:37                                 ` Linus Torvalds
@ 2021-06-16 20:42                                 ` Eric W. Biederman
  1 sibling, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Jun 16, 2021 at 11:32 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Prevent security holes by recording when all of the registers are
>> available so generic code changes do not result in security holes
>> on alpha.
>
> Please no, not this way. ldl/stc is extremely expensive on some alpha cpus.
>
> I really think thatTIF_ALLREGS_SAVED bit isn't worth it, except
> perhaps for debugging.
>
> And even for debugging, I think it would be both easier and cheaper to
> just add a magic word to the entry stack instead.

I think I can do something like that.

Looking at arch/alpha/asm/cache.h it looks like alpha had either 32byte
or 64bit cachelines.  Which makes struct switch_stack a full 10 or 5
cachelines in size.  So pushing something extra might hit an extra
cacheline.

However it looks like struct pt_regs is 16 bytes short of a full cache
line so struct switch_stack isn't going to be cacheline aligned. Adding
an extra 8 bytes of magic number will hopefully be in the noise.

If I can I would like to find something that is cheap enough that I can
always leave on.  Mostly because there is little enough testing that a
bug that allows anyone to stomp the kernel stack has existed for 17 years
without being noticed.

If you want it to be a debug option only I can certainly make that
happen.  I am still going "Eek! Arbitrary stack smash!" in my head.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames
  2021-06-16 20:25                               ` Al Viro
  2021-06-16 20:28                                 ` Al Viro
@ 2021-06-16 20:47                                 ` Eric W. Biederman
  2021-06-16 20:55                                   ` Al Viro
  1 sibling, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 20:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Wed, Jun 16, 2021 at 01:32:50PM -0500, Eric W. Biederman wrote:
>
>> -.macro	fork_like name
>> +.macro	allregs name
>>  	.align	4
>>  	.globl	alpha_\name
>>  	.ent	alpha_\name
>> +	.cfi_startproc
>>  alpha_\name:
>>  	.prologue 0
>> -	bsr	$1, do_switch_stack
>> +	SAVE_SWITCH_STACK
>>  	jsr	$26, sys_\name
>> -	ldq	$26, 56($sp)
>> -	lda	$sp, SWITCH_STACK_SIZE($sp)
>> +	RESTORE_SWITCH_STACK
>
> 	No.  You've just added one hell of an overhead to fork(2),
> for no reason whatsoever.  sys_fork() et.al. does *NOT* modify the
> callee-saved registers; it's plain C.  So this change is complete
> BS.

Fork already saves the registers, all I did was restore them.  Which
makes a debugger that modifies them in
PTRACE_EVENT_{FORK,VFORK,CLONE,VFORK_DONE} work.

>> +allregs exit
>> +allregs exit_group
>
> 	Details, please - what exactly makes exit(2) different from
> e.g. open(2)?

PTRACE_EVENT_EXIT.


Eric




^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames
  2021-06-16 20:28                                 ` Al Viro
@ 2021-06-16 20:49                                   ` Eric W. Biederman
  2021-06-16 20:54                                     ` Al Viro
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 20:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Wed, Jun 16, 2021 at 08:25:35PM +0000, Al Viro wrote:
>> On Wed, Jun 16, 2021 at 01:32:50PM -0500, Eric W. Biederman wrote:
>> 
>> > -.macro	fork_like name
>> > +.macro	allregs name
>> >  	.align	4
>> >  	.globl	alpha_\name
>> >  	.ent	alpha_\name
>> > +	.cfi_startproc
>> >  alpha_\name:
>> >  	.prologue 0
>> > -	bsr	$1, do_switch_stack
>> > +	SAVE_SWITCH_STACK
>> >  	jsr	$26, sys_\name
>> > -	ldq	$26, 56($sp)
>> > -	lda	$sp, SWITCH_STACK_SIZE($sp)
>> > +	RESTORE_SWITCH_STACK
>> 
>> 	No.  You've just added one hell of an overhead to fork(2),
>> for no reason whatsoever.  sys_fork() et.al. does *NOT* modify the
>> callee-saved registers; it's plain C.  So this change is complete
>> BS.
>> 
>> > +allregs exit
>> > +allregs exit_group
>> 
>> 	Details, please - what exactly makes exit(2) different from
>> e.g. open(2)?
>
> Ah... PTRACE_EVENT_EXIT garbage, fortunately having no counterparts in case of
> open(2)...  Still, WTF would you want to restore callee-saved registers for
> in case of exit(2)?

Someone might want or try to read them in the case of exit.  Which
without some change will result in a read of other kernel stack content
on alpha.

Plus there are coredumps which definitely want to read everything.
Although admittedly that case no longer matters.

Eric


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads
  2021-06-15 19:36                     ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman
  2021-06-15 22:02                       ` Linus Torvalds
@ 2021-06-16 20:50                       ` Al Viro
  1 sibling, 0 replies; 119+ messages in thread
From: Al Viro @ 2021-06-16 20:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, Linus Torvalds, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Tue, Jun 15, 2021 at 02:36:38PM -0500, Eric W. Biederman wrote:
> 
> While thinking about the information leaks fixed in 77f6ab8b7768
> ("don't dump the threads that had been already exiting when zapped.")
> I realized the problem is much more general than just coredumps and
> exit_mm.  We have io_uring threads, PTRACE_EVENT_EXEC and
> PTRACE_EVENT_EXIT where ptrace is allowed to access userspace
> registers, but on some architectures has not saved them.

Wait a sec.  To have anything happen on PTRACE_EVENT_EXEC, you need the
fucker traced.  *IF* you want to go that way, at least make it conditional
upon the same thing.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames
  2021-06-16 20:49                                   ` Eric W. Biederman
@ 2021-06-16 20:54                                     ` Al Viro
  0 siblings, 0 replies; 119+ messages in thread
From: Al Viro @ 2021-06-16 20:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 03:49:44PM -0500, Eric W. Biederman wrote:

> Someone might want or try to read them in the case of exit.  Which
> without some change will result in a read of other kernel stack content
> on alpha.

And someone might want a pony.  Again, why bother restoring those,
_especially_ in case of exit(2)?

> Plus there are coredumps which definitely want to read everything.

Huh?  In case of coredump we are going to have come through
$work_notifysig:
        mov     $sp, $16
	DO_SWITCH_STACK
	jsr     $26, do_work_pending
so they *do* have full pt_regs saved.  What's the problem?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames
  2021-06-16 20:47                                 ` Eric W. Biederman
@ 2021-06-16 20:55                                   ` Al Viro
  0 siblings, 0 replies; 119+ messages in thread
From: Al Viro @ 2021-06-16 20:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 03:47:28PM -0500, Eric W. Biederman wrote:

> Fork already saves the registers, all I did was restore them.  Which
> makes a debugger that modifies them in
> PTRACE_EVENT_{FORK,VFORK,CLONE,VFORK_DONE} work.

... first time ever.  Wonderful and well worth the overhead.
</sarcasm>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 20:37                                 ` Linus Torvalds
@ 2021-06-16 20:57                                   ` Eric W. Biederman
  2021-06-16 21:02                                     ` Al Viro
  2021-06-16 21:08                                     ` Linus Torvalds
  0 siblings, 2 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-16 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Jun 16, 2021 at 1:00 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> And even for debugging, I think it would be both easier and cheaper to
>> just add a magic word to the entry stack instead.
>
> IOW, just add a
>
>       unsigned long magic;
>
> to "struct switch_stack", and then make the stack switch code push that value.
>
> That would be cheap enough to be just unconditional, but you could
> make it depend on a debug config option too, of course.
>
> It helps if 'xyz' is some constant that is easyish to generate. It
> might not be a constant - maybe it could be the address of that
> 'magic' field itself, so you'd just generate it with
>
>     stq $r,($r)
>
> and it would be equally easy to just validate at lookup for that WARN_ON_ONCE():
>
>     WARN_ON_ONCE(switch_stack->magic != (unsigned long)&switch_stack->magic);
>
> or whatever.
>
> It's for debugging, not security. So it doesn't have to be some kind
> of super-great magic number, just something easy to generate and check
> (that isn't a common value like "0" that trivially exist on the stack
> anyway).

Fair enough.

I was thinking for a moment that do_sigreturn might have a problem with
that but restore_sigcontext makes it clear that struct switch_stack is
not exposed to userspace.

Do you know if struct switch_stack or pt_regs is ever exposeed to
usespace?  They are both defined in arch/alpha/include/uapi/asm/ptrace.h
which makes me think userspace must see those definitions somewhere.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 20:57                                   ` Eric W. Biederman
@ 2021-06-16 21:02                                     ` Al Viro
  2021-06-16 21:08                                     ` Linus Torvalds
  1 sibling, 0 replies; 119+ messages in thread
From: Al Viro @ 2021-06-16 21:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 03:57:19PM -0500, Eric W. Biederman wrote:
 
> Do you know if struct switch_stack or pt_regs is ever exposeed to
> usespace?  They are both defined in arch/alpha/include/uapi/asm/ptrace.h
> which makes me think userspace must see those definitions somewhere.

	They are exposed, but why mess with those in the first
place?  thread_info->status is strictly thread-synchronous, so just
use it and be done with that...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 20:57                                   ` Eric W. Biederman
  2021-06-16 21:02                                     ` Al Viro
@ 2021-06-16 21:08                                     ` Linus Torvalds
  1 sibling, 0 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-16 21:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Wed, Jun 16, 2021 at 1:57 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Do you know if struct switch_stack or pt_regs is ever exposeed to
> usespace?  They are both defined in arch/alpha/include/uapi/asm/ptrace.h
> which makes me think userspace must see those definitions somewhere.

Yeah, that uapi location is a bit unfortunate. It means that user
space _could_ have seen it.

Which probably means that some user space uses it. Not for any kernel
interfaces (the alpha ptrace register offsets are actually sane, and
we have that "regoff[]" array to find them) - but I could see some odd
program having decided to use the kernel pt_regs and switch_stack
structures for their own reasons.

Annoying. Because we don't really expose it as-is in any way, afaik.
Only incidentally - and by mistake - in a uapi header file.

Maybe a flag in thread_info->status (or even a new 32-bit field
entirely in thread_info) is the way to go like Al says.

            Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-16 18:31                             ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman
  2021-06-16 20:00                               ` Linus Torvalds
  2021-06-16 20:17                               ` Al Viro
@ 2021-06-21  2:01                               ` Michael Schmitz
  2021-06-21  2:17                                 ` Linus Torvalds
  2021-06-21  2:27                                 ` Al Viro
  2 siblings, 2 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-21  2:01 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Hi Eric,

instrumenting get_reg on m68k and using a similar patch to yours to warn 
when unsaved registers are accessed on the switch stack, I get a hit 
from getegid and getegid32, just by running a simple ptrace on ls.

Going to wack those two moles now ...

Cheers,

     Michael


On 17/06/21 6:31 am, Eric W. Biederman wrote:
> While thinking about the information leaks fixed in 77f6ab8b7768
> ("don't dump the threads that had been already exiting when zapped.")
> I realized the problem is much more general than just coredumps and
> exit_mm.  We have io_uring threads, PTRACE_EVENT_FORK,
> PTRACE_EVENT_VFORK, PTRACE_EVENT_CLONE, PTRACE_EVENT_EXEC and
> PTRACE_EVENT_EXIT where ptrace is allowed to access userspace
> registers, but on some architectures has not saved them so
> they can be modified.
>
> The function alpha_switch_to does something reasonable it saves the
> floating point registers and the caller saved registers and switches
> to a different thread.  Any register the caller is not expected to
> save it does not save.
>
> Meanhile the system call entry point on alpha also does something
> reasonable.  The system call entry point saves all but the caller
> saved integer registers and doesn't touch the floating point registers
> as the kernel code does not touch them.
>
> This is a nice happy fast path until the kernel wants to access the
> user space's registers through ptrace or similar.  As user spaces's
> caller saved registers may be saved at an unpredictable point in the
> kernel code's stack, the routine which may stop and make the userspace
> registers available must be wrapped by code that will first save a
> switch stack frame at the bottom of the call stack, call the code that
> may access those registers and then pop the switch stack frame.
>
> The practical problem with this code structure is that this results in
> a game of whack-a-mole wrapping different kernel system calls.  Loosing
> the game of whack-a-mole results in a security hole where userspace can
> write arbitrary data to the kernel stack.
>
> In general it is not possible to prevent generic code introducing a
> ptrace_stop or register access not knowing alpha's limitations, that
> where alpha does not make all of the registers avaliable.
>
> Prevent security holes by recording when all of the registers are
> available so generic code changes do not result in security holes
> on alpha.
>
> Cc: stable@vger.kernel.org
> Fixes: dbe1bdbb39db ("io_uring: handle signals for IO threads like a normal thread")
> Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism")
> History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>   arch/alpha/include/asm/thread_info.h |  2 ++
>   arch/alpha/kernel/entry.S            | 38 ++++++++++++++++++++++------
>   arch/alpha/kernel/ptrace.c           | 13 ++++++++--
>   3 files changed, 43 insertions(+), 10 deletions(-)
>
> diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
> index 2592356e3215..41e5986ed9c8 100644
> --- a/arch/alpha/include/asm/thread_info.h
> +++ b/arch/alpha/include/asm/thread_info.h
> @@ -63,6 +63,7 @@ register struct thread_info *__current_thread_info __asm__("$8");
>   #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
>   #define TIF_SYSCALL_AUDIT	4	/* syscall audit active */
>   #define TIF_NOTIFY_SIGNAL	5	/* signal notifications exist */
> +#define TIF_ALLREGS_SAVED	6	/* both pt_regs and switch_stack saved */
>   #define TIF_DIE_IF_KERNEL	9	/* dik recursion lock */
>   #define TIF_MEMDIE		13	/* is terminating due to OOM killer */
>   #define TIF_POLLING_NRFLAG	14	/* idle is polling for TIF_NEED_RESCHED */
> @@ -73,6 +74,7 @@ register struct thread_info *__current_thread_info __asm__("$8");
>   #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
>   #define _TIF_SYSCALL_AUDIT	(1<<TIF_SYSCALL_AUDIT)
>   #define _TIF_NOTIFY_SIGNAL	(1<<TIF_NOTIFY_SIGNAL)
> +#define _TIF_ALLREGS_SAVED	(1<<TIF_ALLREGS_SAVED)
>   #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
>   
>   /* Work to do on interrupt/exception return.  */
> diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
> index e227f3a29a43..c1edf54dc035 100644
> --- a/arch/alpha/kernel/entry.S
> +++ b/arch/alpha/kernel/entry.S
> @@ -174,6 +174,28 @@
>   	.cfi_adjust_cfa_offset	-SWITCH_STACK_SIZE
>   .endm
>   
> +.macro	SAVE_SWITCH_STACK
> +	DO_SWITCH_STACK
> +1:	ldl_l	$1, TI_FLAGS($8)
> +	bis	$1, _TIF_ALLREGS_SAVED, $1
> +	stl_c	$1, TI_FLAGS($8)
> +	beq	$1, 2f
> +.subsection 2
> +2:	br	1b
> +.previous
> +.endm
> +
> +.macro	RESTORE_SWITCH_STACK
> +1:	ldl_l	$1, TI_FLAGS($8)
> +	bic	$1, _TIF_ALLREGS_SAVED, $1
> +	stl_c	$1, TI_FLAGS($8)
> +	beq	$1, 2f
> +.subsection 2
> +2:	br	1b
> +.previous
> +	UNDO_SWITCH_STACK
> +.endm
> +
>   /*
>    * Non-syscall kernel entry points.
>    */
> @@ -559,9 +581,9 @@ $work_resched:
>   
>   $work_notifysig:
>   	mov	$sp, $16
> -	DO_SWITCH_STACK
> +	SAVE_SWITCH_STACK
>   	jsr	$26, do_work_pending
> -	UNDO_SWITCH_STACK
> +	RESTORE_SWITCH_STACK
>   	br	restore_all
>   
>   /*
> @@ -572,9 +594,9 @@ $work_notifysig:
>   	.type	strace, @function
>   strace:
>   	/* set up signal stack, call syscall_trace */
> -	DO_SWITCH_STACK
> +	SAVE_SWITCH_STACK
>   	jsr	$26, syscall_trace_enter /* returns the syscall number */
> -	UNDO_SWITCH_STACK
> +	RESTORE_SWITCH_STACK
>   
>   	/* get the arguments back.. */
>   	ldq	$16, SP_OFF+24($sp)
> @@ -602,9 +624,9 @@ ret_from_straced:
>   $strace_success:
>   	stq	$0, 0($sp)		/* save return value */
>   
> -	DO_SWITCH_STACK
> +	SAVE_SWITCH_STACK
>   	jsr	$26, syscall_trace_leave
> -	UNDO_SWITCH_STACK
> +	RESTORE_SWITCH_STACK
>   	br	$31, ret_from_sys_call
>   
>   	.align	3
> @@ -618,13 +640,13 @@ $strace_error:
>   	stq	$0, 0($sp)
>   	stq	$1, 72($sp)	/* a3 for return */
>   
> -	DO_SWITCH_STACK
> +	SAVE_SWITCH_STACK
>   	mov	$18, $9		/* save old syscall number */
>   	mov	$19, $10	/* save old a3 */
>   	jsr	$26, syscall_trace_leave
>   	mov	$9, $18
>   	mov	$10, $19
> -	UNDO_SWITCH_STACK
> +	RESTORE_SWITCH_STACK
>   
>   	mov	$31, $26	/* tell "ret_from_sys_call" we can restart */
>   	br	ret_from_sys_call
> diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
> index 8c43212ae38e..41fb994f36dc 100644
> --- a/arch/alpha/kernel/ptrace.c
> +++ b/arch/alpha/kernel/ptrace.c
> @@ -117,7 +117,13 @@ get_reg_addr(struct task_struct * task, unsigned long regno)
>   		zero = 0;
>   		addr = &zero;
>   	} else {
> -		addr = task_stack_page(task) + regoff[regno];
> +		int off = regoff[regno];
> +		if (WARN_ON_ONCE((off < PT_REG(r0)) &&
> +				!test_ti_thread_flag(task_thread_info(task),
> +						     TIF_ALLREGS_SAVED)))
> +			addr = &zero;
> +		else
> +			addr = task_stack_page(task) + off;
>   	}
>   	return addr;
>   }
> @@ -145,13 +151,16 @@ get_reg(struct task_struct * task, unsigned long regno)
>   static int
>   put_reg(struct task_struct *task, unsigned long regno, unsigned long data)
>   {
> +	unsigned long *addr;
>   	if (regno == 63) {
>   		task_thread_info(task)->ieee_state
>   		  = ((task_thread_info(task)->ieee_state & ~IEEE_SW_MASK)
>   		     | (data & IEEE_SW_MASK));
>   		data = (data & FPCR_DYN_MASK) | ieee_swcr_to_fpcr(data);
>   	}
> -	*get_reg_addr(task, regno) = data;
> +	addr = get_reg_addr(task, regno);
> +	if (addr != &zero)
> +		*addr = data;
>   	return 0;
>   }
>   

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-21  2:01                               ` Michael Schmitz
@ 2021-06-21  2:17                                 ` Linus Torvalds
  2021-06-21  3:18                                   ` Michael Schmitz
  2021-06-21  2:27                                 ` Al Viro
  1 sibling, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-21  2:17 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov,
	Al Viro, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Sun, Jun 20, 2021 at 7:01 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
>
> instrumenting get_reg on m68k and using a similar patch to yours to warn
> when unsaved registers are accessed on the switch stack, I get a hit
> from getegid and getegid32, just by running a simple ptrace on ls.
>
> Going to wack those two moles now ...

I don't see what's going on. Those system calls don't use the register
state, afaik. What's the call chain, exactly?

           Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-21  2:01                               ` Michael Schmitz
  2021-06-21  2:17                                 ` Linus Torvalds
@ 2021-06-21  2:27                                 ` Al Viro
  2021-06-21  3:36                                   ` Michael Schmitz
  1 sibling, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-21  2:27 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 02:01:18PM +1200, Michael Schmitz wrote:
> Hi Eric,
> 
> instrumenting get_reg on m68k and using a similar patch to yours to warn
> when unsaved registers are accessed on the switch stack, I get a hit from
> getegid and getegid32, just by running a simple ptrace on ls.
> 
> Going to wack those two moles now ...

Explain, please.  get_reg() is called by tracer; whose state are you checking?
Because you are *not* accessing the switch stack of the caller of get_reg().
And tracee should be in something like syscall_trace() or do_notify_resume();
both have SAVE_SWITCH_STACK done by the glue...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-21  2:17                                 ` Linus Torvalds
@ 2021-06-21  3:18                                   ` Michael Schmitz
  2021-06-21  3:37                                     ` Linus Torvalds
  2021-06-21  3:44                                     ` Al Viro
  0 siblings, 2 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-21  3:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov,
	Al Viro, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Hi Linus,

Am 21.06.2021 um 14:17 schrieb Linus Torvalds:
> On Sun, Jun 20, 2021 at 7:01 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
>>
>> instrumenting get_reg on m68k and using a similar patch to yours to warn
>> when unsaved registers are accessed on the switch stack, I get a hit
>> from getegid and getegid32, just by running a simple ptrace on ls.
>>
>> Going to wack those two moles now ...
>
> I don't see what's going on. Those system calls don't use the register
> state, afaik. What's the call chain, exactly?

This is what I get from WARN_ONCE:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 1177 at arch/m68k/kernel/ptrace.c:91 get_reg+0x90/0xb8
Modules linked in:
CPU: 0 PID: 1177 Comm: strace Not tainted 
5.13.0-rc1-atari-fpuemu-exitfix+ #1146
Stack from 014b7f04:
         014b7f04 00336401 00336401 000278f0 0032c015 0000005b 00000005 
0002795a
         0032c015 0000005b 0000338c 00000009 00000000 00000000 ffffffe4 
00000005
         00000003 00000014 00000003 00000014 efc2b90c 0000338c 0032c015 
0000005b
         00000009 00000000 efc2b908 00912540 efc2b908 000034cc 00912540 
00000005
         00000000 efc2b908 00000003 00912540 8000110c c010b0a4 efc2b90c 
0002d1d8
         00912540 00000003 00000014 efc2b908 0000049a 00000014 efc2b908 
800acaa8
Call Trace: [<000278f0>] __warn+0x9e/0xb4
  [<0002795a>] warn_slowpath_fmt+0x54/0x62
  [<0000338c>] get_reg+0x90/0xb8
  [<0000338c>] get_reg+0x90/0xb8
  [<000034cc>] arch_ptrace+0x7e/0x250
  [<0002d1d8>] sys_ptrace+0x232/0x2f8
  [<00002ab6>] syscall+0x8/0xc
  [<0000c00b>] lower+0x7/0x20

---[ end trace ee4be53b94695793 ]---

Syscall numbers are actually 90 and 192 - sys_old_mmap and sys_mmap2 on 
m68k. Used the calculator on my Ubuntu desktop, that appears to be a 
little confused about hex to decimal conversions.

I hope that makes more sense?

Cheers,

	Michael

>
>            Linus
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-21  2:27                                 ` Al Viro
@ 2021-06-21  3:36                                   ` Michael Schmitz
  0 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-21  3:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Hi Al,


Am 21.06.2021 um 14:27 schrieb Al Viro:
> On Mon, Jun 21, 2021 at 02:01:18PM +1200, Michael Schmitz wrote:
>> Hi Eric,
>>
>> instrumenting get_reg on m68k and using a similar patch to yours to warn
>> when unsaved registers are accessed on the switch stack, I get a hit from
>> getegid and getegid32, just by running a simple ptrace on ls.
>>
>> Going to wack those two moles now ...
>
> Explain, please.  get_reg() is called by tracer; whose state are you checking?

The check is only triggered when syscall tracing (I set a flag on trace 
entry, and clear that on trace exit)... From the WARN_ONCE stack dump, 
it appears that I get the warning from inside the syscall, not 
syscall_trace().

> Because you are *not* accessing the switch stack of the caller of get_reg().
> And tracee should be in something like syscall_trace() or do_notify_resume();
> both have SAVE_SWITCH_STACK done by the glue...

And that's where my problem may be - I stupidly forgot to set the 'all 
registers saved' flag before calling syscall_trace() ...

I'll fix that and try again. Sorry for the noise!

Cheers,

	Michael



>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-21  3:18                                   ` Michael Schmitz
@ 2021-06-21  3:37                                     ` Linus Torvalds
  2021-06-21  4:08                                       ` Michael Schmitz
  2021-06-21  3:44                                     ` Al Viro
  1 sibling, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-21  3:37 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov,
	Al Viro, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Sun, Jun 20, 2021 at 8:18 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
>
> I hope that makes more sense?

So the problem is in your debug patch: you don't set that
TIS_SWITCH_STACK in nearly enough places.

In this particular example, I think it's that you don't set it in
do_trace_exit, so when you strace the process, the system call exit -
which is where the return value will be picked up - gets that warning.

You did set TIS_SWITCH_STACK on trace_entry, but then it's cleared
again during the system call, and not set at the trace_exit path.
Oddly, your debug patch also _clears_ it on the exit path, but it
doesn't set it when do_trace_exit does the SAVE_SWITCH_STACK.

You oddly also set it for __sys_exit, but not all the other special
system calls that also do that SAVE_SWITCH_STACK.

Really, pretty much every single case of SAVE_SWITCH_STACK would need
to set it. Not just do_trace_enter/exit

It's why I didn't like Eric's debug patch either. It's quite expensive
to do, partly because you look up that curptr thing. All very nasty.

It would be *much* better to make the flag be part of the stack frame,
but sadly at least on alpha we had exported the format of that stack
frame to user space.

Anyway, I think these debug patches are not just expensive but the
m68k one most definitely is also very incomplete.

             Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-21  3:18                                   ` Michael Schmitz
  2021-06-21  3:37                                     ` Linus Torvalds
@ 2021-06-21  3:44                                     ` Al Viro
  2021-06-21  5:31                                       ` Michael Schmitz
  1 sibling, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-21  3:44 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Linus Torvalds, Eric W. Biederman, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 03:18:35PM +1200, Michael Schmitz wrote:

> This is what I get from WARN_ONCE:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1177 at arch/m68k/kernel/ptrace.c:91 get_reg+0x90/0xb8
> Modules linked in:
> CPU: 0 PID: 1177 Comm: strace Not tainted 5.13.0-rc1-atari-fpuemu-exitfix+
> #1146
> Stack from 014b7f04:
>         014b7f04 00336401 00336401 000278f0 0032c015 0000005b 00000005
> 0002795a
>         0032c015 0000005b 0000338c 00000009 00000000 00000000 ffffffe4
> 00000005
>         00000003 00000014 00000003 00000014 efc2b90c 0000338c 0032c015
> 0000005b
>         00000009 00000000 efc2b908 00912540 efc2b908 000034cc 00912540
> 00000005
>         00000000 efc2b908 00000003 00912540 8000110c c010b0a4 efc2b90c
> 0002d1d8
>         00912540 00000003 00000014 efc2b908 0000049a 00000014 efc2b908
> 800acaa8
> Call Trace: [<000278f0>] __warn+0x9e/0xb4
>  [<0002795a>] warn_slowpath_fmt+0x54/0x62
>  [<0000338c>] get_reg+0x90/0xb8
>  [<0000338c>] get_reg+0x90/0xb8
>  [<000034cc>] arch_ptrace+0x7e/0x250
>  [<0002d1d8>] sys_ptrace+0x232/0x2f8
>  [<00002ab6>] syscall+0x8/0xc
>  [<0000c00b>] lower+0x7/0x20
> 
> ---[ end trace ee4be53b94695793 ]---
> 
> Syscall numbers are actually 90 and 192 - sys_old_mmap and sys_mmap2 on
> m68k. Used the calculator on my Ubuntu desktop, that appears to be a little
> confused about hex to decimal conversions.
> 
> I hope that makes more sense?

Not really; what is the condition you are checking?  The interesting trace
is not that with get_reg() - it's that of the process being traced.  You
are not accessing the stack of caller of ptrace(2) here, so you want to
know that SAVE_SWITCH_STACK had been done by the tracee, not tracer.

And if that had been strace ls, you have TIF_SYSCALL_TRACE set for ls, so
	* ls hits system_call
	* notices TIF_SYSCALL_TRACE and goes to do_trace_entry
	* does SAVE_SWITCH_STACK there
	* calls syscall_trace(), which calls ptrace_notify()
	* ptrace_notify() calls ptrace_do_notify(), which calls ptrace_stop()
	* ptrace_stop() arranges for tracer to be woken up and gives CPU up,
with TASK_TRACED as process state.

That's the callchain in ls, and switch_stack accessed by get_reg() from
strace is the one on ls(1) stack created by SAVE_SWITCH_STACK.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-21  3:37                                     ` Linus Torvalds
@ 2021-06-21  4:08                                       ` Michael Schmitz
  0 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-21  4:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov,
	Al Viro, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Hi Linus,

I realized that the patch is still incomplete when answering Al...

Am 21.06.2021 um 15:37 schrieb Linus Torvalds:
> On Sun, Jun 20, 2021 at 8:18 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
>>
>> I hope that makes more sense?
>
> So the problem is in your debug patch: you don't set that
> TIS_SWITCH_STACK in nearly enough places.
>
> In this particular example, I think it's that you don't set it in
> do_trace_exit, so when you strace the process, the system call exit -
> which is where the return value will be picked up - gets that warning.
>
> You did set TIS_SWITCH_STACK on trace_entry, but then it's cleared
> again during the system call, and not set at the trace_exit path.
> Oddly, your debug patch also _clears_ it on the exit path, but it
> doesn't set it when do_trace_exit does the SAVE_SWITCH_STACK.
>
> You oddly also set it for __sys_exit, but not all the other special
> system calls that also do that SAVE_SWITCH_STACK.

That's the one I used to test whether my debug patch had any ill side 
effects (i.e. smashing the stack) late yesterday. Forgot to add that to 
the other cases.

>
> Really, pretty much every single case of SAVE_SWITCH_STACK would need
> to set it. Not just do_trace_enter/exit

Yes - done that now and the warning is gone.

> It's why I didn't like Eric's debug patch either. It's quite expensive
> to do, partly because you look up that curptr thing. All very nasty.

I need to talk to Geert and Andreas to find where register a1 is 
preserved, but if I have to reload a1 all the time, this won't be useful 
except for debugging.

> It would be *much* better to make the flag be part of the stack frame,
> but sadly at least on alpha we had exported the format of that stack
> frame to user space.

Same on m68k, but can we push a flag _after_ the switch stack?

> Anyway, I think these debug patches are not just expensive but the
> m68k one most definitely is also very incomplete.

Yes, I've seen that in the meantime. Need to triple check my work next 
time.

Sorry for the extra noise!

Cheers,

	Michael

>
>              Linus
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack
  2021-06-21  3:44                                     ` Al Viro
@ 2021-06-21  5:31                                       ` Michael Schmitz
  0 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-21  5:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Eric W. Biederman, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Hi Al,

Am 21.06.2021 um 15:44 schrieb Al Viro:
> On Mon, Jun 21, 2021 at 03:18:35PM +1200, Michael Schmitz wrote:
>
>> This is what I get from WARN_ONCE:
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 1177 at arch/m68k/kernel/ptrace.c:91 get_reg+0x90/0xb8
>> Modules linked in:
>> CPU: 0 PID: 1177 Comm: strace Not tainted 5.13.0-rc1-atari-fpuemu-exitfix+
>> #1146
>> Stack from 014b7f04:
>>         014b7f04 00336401 00336401 000278f0 0032c015 0000005b 00000005
>> 0002795a
>>         0032c015 0000005b 0000338c 00000009 00000000 00000000 ffffffe4
>> 00000005
>>         00000003 00000014 00000003 00000014 efc2b90c 0000338c 0032c015
>> 0000005b
>>         00000009 00000000 efc2b908 00912540 efc2b908 000034cc 00912540
>> 00000005
>>         00000000 efc2b908 00000003 00912540 8000110c c010b0a4 efc2b90c
>> 0002d1d8
>>         00912540 00000003 00000014 efc2b908 0000049a 00000014 efc2b908
>> 800acaa8
>> Call Trace: [<000278f0>] __warn+0x9e/0xb4
>>  [<0002795a>] warn_slowpath_fmt+0x54/0x62
>>  [<0000338c>] get_reg+0x90/0xb8
>>  [<0000338c>] get_reg+0x90/0xb8
>>  [<000034cc>] arch_ptrace+0x7e/0x250
>>  [<0002d1d8>] sys_ptrace+0x232/0x2f8
>>  [<00002ab6>] syscall+0x8/0xc
>>  [<0000c00b>] lower+0x7/0x20
>>
>> ---[ end trace ee4be53b94695793 ]---
>>
>> Syscall numbers are actually 90 and 192 - sys_old_mmap and sys_mmap2 on
>> m68k. Used the calculator on my Ubuntu desktop, that appears to be a little
>> confused about hex to decimal conversions.
>>
>> I hope that makes more sense?
>
> Not really; what is the condition you are checking?  The interesting trace

The check in get_reg() is:


            if (WARN_ON_ONCE((off < PT_REG(d1)) &&
               test_ti_thread_status(task_thread_info(task),TIS_TRACING)
                    && !test_ti_thread_status(task_thread_info(task),
                                         TIS_ALLREGS_SAVED))) {
                    unsigned long *addr_d0;
                    addr_d0 = (unsigned long *)(task->thread.esp0 + 
regoff[16]);
                    pr_err("get_reg with incomplete stack, regno %d offs 
%d orig_d0 %lx\n", regno, off, *addr_d0);
                    return 0;
            }


> is not that with get_reg() - it's that of the process being traced.  You
> are not accessing the stack of caller of ptrace(2) here, so you want to
> know that SAVE_SWITCH_STACK had been done by the tracee, not tracer.
>
> And if that had been strace ls, you have TIF_SYSCALL_TRACE set for ls, so
> 	* ls hits system_call
> 	* notices TIF_SYSCALL_TRACE and goes to do_trace_entry
> 	* does SAVE_SWITCH_STACK there

... and sets both the new TIS_TRACING and TIS_ALLREGS_SAVED flags in the 
thread_info->status field (now that I've corrected my patch).

> 	* calls syscall_trace(), which calls ptrace_notify()
> 	* ptrace_notify() calls ptrace_do_notify(), which calls ptrace_stop()
> 	* ptrace_stop() arranges for tracer to be woken up and gives CPU up,
> with TASK_TRACED as process state.

Thanks for explaining! So in order to get a trace for the process being 
traced, I would have to check the TIS_ALLREGS_SAVED in ptrace_stop()?

> That's the callchain in ls, and switch_stack accessed by get_reg() from
> strace is the one on ls(1) stack created by SAVE_SWITCH_STACK.

So testing for TIS_ALLREGS_SAVED in get_reg() (called by the tracer, but 
with the tracee's task struct passed to arch_ptrace()) does check that 
SAVE_SWITCH_STACK was done before the syscall in the tracee, right?

Anyway, I'd missed setting the flags for some crucial SAVE_SWITCH_STACK 
operations in my woefully incomplete patch. With that corrected, there's 
no more warning from mmap. I'll try with a more recent version of strace 
and gdb once I've updated my test image.

Cheers,

	Michael

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-15 21:58                     ` Linus Torvalds
  2021-06-16 15:06                       ` Eric W. Biederman
@ 2021-06-21 13:54                       ` Al Viro
  2021-06-21 14:16                         ` Al Viro
                                           ` (2 more replies)
  1 sibling, 3 replies; 119+ messages in thread
From: Al Viro @ 2021-06-21 13:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:

> And I think our horrible "kernel threads return to user space when
> done" is absolutely horrifically nasty. Maybe of the clever sort, but
> mostly of the historical horror sort.

How would you prefer to handle that, then?  Separate magical path from
kernel_execve() to switch to userland?  We used to have something of
that sort, and that had been a real horror...

As it is, it's "kernel thread is spawned at the point similar to
ret_from_fork(), runs the payload (which almost never returns) and
then proceeds out to userland, same way fork(2) would've done."
That way kernel_execve() doesn't have to do anything magical.

Al, digging through the old notes and current call graph...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 13:54                       ` Al Viro
@ 2021-06-21 14:16                         ` Al Viro
  2021-06-21 16:50                           ` Eric W. Biederman
  2021-06-21 15:38                         ` Linus Torvalds
  2021-06-21 18:59                         ` Al Viro
  2 siblings, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-21 14:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
> 
> > And I think our horrible "kernel threads return to user space when
> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> > mostly of the historical horror sort.
> 
> How would you prefer to handle that, then?  Separate magical path from
> kernel_execve() to switch to userland?  We used to have something of
> that sort, and that had been a real horror...
> 
> As it is, it's "kernel thread is spawned at the point similar to
> ret_from_fork(), runs the payload (which almost never returns) and
> then proceeds out to userland, same way fork(2) would've done."
> That way kernel_execve() doesn't have to do anything magical.
> 
> Al, digging through the old notes and current call graph...

	FWIW, the major assumption back then had been that get_signal(),
signal_delivered() and all associated machinery (including coredumps)
runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.

	And "has complete registers on stack" is only a part of that;
there was other fun stuff in the area ;-/  Do we want coredumps for
those, and if we do, will the de_thread stuff work there?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 13:54                       ` Al Viro
  2021-06-21 14:16                         ` Al Viro
@ 2021-06-21 15:38                         ` Linus Torvalds
  2021-06-21 18:59                         ` Al Viro
  2 siblings, 0 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-21 15:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 6:55 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>
> > And I think our horrible "kernel threads return to user space when
> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> > mostly of the historical horror sort.
>
> How would you prefer to handle that, then?  Separate magical path from
> kernel_execve() to switch to userland?  We used to have something of
> that sort, and that had been a real horror...

Hmm. Maybe the alternatives would all be worse. The current thing is
clever, and shares the return path with the normal case. It's just
also a bit surprising, in that a kernel thread normally must not
return - with the magical exception of "if it had done a
kernel_execve() at some point, then returning is magically the way you
actually start user mode".

So it all feels very special, and there's not even a comment about it.

I think we only have two users of that thing (the very first 'init',
and user-mode-helpr), So I guess it doesn't really matter.

            Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 14:16                         ` Al Viro
@ 2021-06-21 16:50                           ` Eric W. Biederman
  2021-06-21 23:05                             ` Al Viro
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-21 16:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
>> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>> 
>> > And I think our horrible "kernel threads return to user space when
>> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
>> > mostly of the historical horror sort.
>> 
>> How would you prefer to handle that, then?  Separate magical path from
>> kernel_execve() to switch to userland?  We used to have something of
>> that sort, and that had been a real horror...
>> 
>> As it is, it's "kernel thread is spawned at the point similar to
>> ret_from_fork(), runs the payload (which almost never returns) and
>> then proceeds out to userland, same way fork(2) would've done."
>> That way kernel_execve() doesn't have to do anything magical.
>> 
>> Al, digging through the old notes and current call graph...
>
> 	FWIW, the major assumption back then had been that get_signal(),
> signal_delivered() and all associated machinery (including coredumps)
> runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.
>
> 	And "has complete registers on stack" is only a part of that;
> there was other fun stuff in the area ;-/  Do we want coredumps for
> those, and if we do, will the de_thread stuff work there?

Do we want coredumps from processes that use io_uring? yes
Exactly what we want from io_uring threads is less clear.  We can't
really give much that is meaningful beyond the thread ids of the
io_uring threads.

What problems do are you seeing beyond the missing registers on the
stack for kernel threads?

I don't immediately see the connection between coredumps and de_thread.

The function de_thread arranges for the fatal_signal_pending to be true,
and that should work just fine for io_uring threads.  The io_uring
threads process the fatal_signal with get_signal and then proceed to
exit eventually calling do_exit.

Eric






^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 13:54                       ` Al Viro
  2021-06-21 14:16                         ` Al Viro
  2021-06-21 15:38                         ` Linus Torvalds
@ 2021-06-21 18:59                         ` Al Viro
  2021-06-21 19:22                           ` Linus Torvalds
  2021-06-21 19:24                           ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Al Viro
  2 siblings, 2 replies; 119+ messages in thread
From: Al Viro @ 2021-06-21 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
> 
> > And I think our horrible "kernel threads return to user space when
> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> > mostly of the historical horror sort.
> 
> How would you prefer to handle that, then?  Separate magical path from
> kernel_execve() to switch to userland?  We used to have something of
> that sort, and that had been a real horror...
> 
> As it is, it's "kernel thread is spawned at the point similar to
> ret_from_fork(), runs the payload (which almost never returns) and
> then proceeds out to userland, same way fork(2) would've done."
> That way kernel_execve() doesn't have to do anything magical.
> 
> Al, digging through the old notes and current call graph...

	There's a large mess around do_exit() - we have a bunch of
callers all over arch/*; if nothing else, I very much doubt that really
want to let tracer play with a thread in the middle of die_if_kernel()
or similar.

We sure as hell do not want to arrange for anything on the kernel
stack in such situations, no matter what's done in exit(2)...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 18:59                         ` Al Viro
@ 2021-06-21 19:22                           ` Linus Torvalds
  2021-06-21 19:45                             ` Al Viro
  2021-06-21 20:03                             ` Eric W. Biederman
  2021-06-21 19:24                           ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Al Viro
  1 sibling, 2 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-21 19:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 11:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         There's a large mess around do_exit() - we have a bunch of
> callers all over arch/*; if nothing else, I very much doubt that really
> want to let tracer play with a thread in the middle of die_if_kernel()
> or similar.

Right you are.

I'm really beginning to hate ptrace_{event,notify}() and those
PTRACE_EVENT_xyz things.

I don't even know what uses them, honestly. How very annoying.

I guess it's easy enough (famous last words) to move the
ptrace_event() call out of do_exit() and into the actual
exit/exit_group system calls, and the signal handling path. The paths
that actually have proper pt_regs.

Looks like sys_exit() and do_group_exit() would be the two places to
do it (do_group_exit() would handle the signal case and
sys_group_exit()).

               Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 18:59                         ` Al Viro
  2021-06-21 19:22                           ` Linus Torvalds
@ 2021-06-21 19:24                           ` Al Viro
  2021-06-21 23:24                             ` Michael Schmitz
  1 sibling, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-21 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 06:59:01PM +0000, Al Viro wrote:
> On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> > On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
> > 
> > > And I think our horrible "kernel threads return to user space when
> > > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> > > mostly of the historical horror sort.
> > 
> > How would you prefer to handle that, then?  Separate magical path from
> > kernel_execve() to switch to userland?  We used to have something of
> > that sort, and that had been a real horror...
> > 
> > As it is, it's "kernel thread is spawned at the point similar to
> > ret_from_fork(), runs the payload (which almost never returns) and
> > then proceeds out to userland, same way fork(2) would've done."
> > That way kernel_execve() doesn't have to do anything magical.
> > 
> > Al, digging through the old notes and current call graph...
> 
> 	There's a large mess around do_exit() - we have a bunch of
> callers all over arch/*; if nothing else, I very much doubt that really
> want to let tracer play with a thread in the middle of die_if_kernel()
> or similar.
> 
> We sure as hell do not want to arrange for anything on the kernel
> stack in such situations, no matter what's done in exit(2)...

FWIW, on alpha it's die_if_kernel(), do_entUna() and do_page_fault(),
all in not-from-userland cases.  On m68k - die_if_kernel(), do_page_fault()
(both for non-from-userland cases) and something really odd - fpsp040_die().
Exception handling for floating point stuff on 68040?  Looks like it has
an open-coded copy_to_user()/copy_from_user(), with faults doing hard
do_exit(SIGSEGV) instead of raising a signal and trying to do something
sane...

I really don't want to try and figure out how painful would it be to
teach that code how to deal with faults - _testing_ anything in that
area sure as hell will be.  IIRC, details of recovery from FPU exceptions
on 68040 in the manual left impression of a minefield...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 19:22                           ` Linus Torvalds
@ 2021-06-21 19:45                             ` Al Viro
  2021-06-21 23:14                               ` Linus Torvalds
  2021-06-21 20:03                             ` Eric W. Biederman
  1 sibling, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-21 19:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook,
	Tetsuo Handa

On Mon, Jun 21, 2021 at 12:22:06PM -0700, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 11:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         There's a large mess around do_exit() - we have a bunch of
> > callers all over arch/*; if nothing else, I very much doubt that really
> > want to let tracer play with a thread in the middle of die_if_kernel()
> > or similar.
> 
> Right you are.
> 
> I'm really beginning to hate ptrace_{event,notify}() and those
> PTRACE_EVENT_xyz things.
> 
> I don't even know what uses them, honestly. How very annoying.
> 
> I guess it's easy enough (famous last words) to move the
> ptrace_event() call out of do_exit() and into the actual
> exit/exit_group system calls, and the signal handling path. The paths
> that actually have proper pt_regs.
> 
> Looks like sys_exit() and do_group_exit() would be the two places to
> do it (do_group_exit() would handle the signal case and
> sys_group_exit()).

Maybe...  I'm digging through that pile right now, will follow up when
I get a reasonably complete picture.  In the meanwhile, do kernel/kthread.c
uses look even remotely sane?  Intentional - sure, but it really looks
wrong to use thread exit code as communication channel there...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 19:22                           ` Linus Torvalds
  2021-06-21 19:45                             ` Al Viro
@ 2021-06-21 20:03                             ` Eric W. Biederman
  2021-06-21 23:15                               ` Linus Torvalds
  1 sibling, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-21 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jun 21, 2021 at 11:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>         There's a large mess around do_exit() - we have a bunch of
>> callers all over arch/*; if nothing else, I very much doubt that really
>> want to let tracer play with a thread in the middle of die_if_kernel()
>> or similar.
>
> Right you are.
>
> I'm really beginning to hate ptrace_{event,notify}() and those
> PTRACE_EVENT_xyz things.
>
> I don't even know what uses them, honestly. How very annoying.

Modern strace does.  Modern gdb appears not to.

However strace at least does not read the exit code,
or really appear to care about stopping for PTRACE_EVENT_EXIT.

I completely agree with you that they are very annoying.

> I guess it's easy enough (famous last words) to move the
> ptrace_event() call out of do_exit() and into the actual
> exit/exit_group system calls, and the signal handling path. The paths
> that actually have proper pt_regs.
>
> Looks like sys_exit() and do_group_exit() would be the two places to
> do it (do_group_exit() would handle the signal case and
> sys_group_exit()).

For other ptrace_event calls I am playing with seeing if I can split
them in two.  Like sending a signal.  So that we can have perform all
of the work in get_signal.

I think we can even change exit_group(2) and exit(2) so that (at least
when ptraced) they just send the ``event signal'' and then the signal
handling path handles all of the ptrace stuff.


When I started it was just going to be exit and PTRACE_EVENT_EXIT and
some old architectures, and that a generic solution was going to be
hard.

I still think we are going to need to fix the io_uring threads on the
architectures that use the caller saved register optimization like alpha
and m68k.  But I think we can handle the rest in generic code.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 16:50                           ` Eric W. Biederman
@ 2021-06-21 23:05                             ` Al Viro
  2021-06-22 16:39                               ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-21 23:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
> >> 
> >> > And I think our horrible "kernel threads return to user space when
> >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> >> > mostly of the historical horror sort.
> >> 
> >> How would you prefer to handle that, then?  Separate magical path from
> >> kernel_execve() to switch to userland?  We used to have something of
> >> that sort, and that had been a real horror...
> >> 
> >> As it is, it's "kernel thread is spawned at the point similar to
> >> ret_from_fork(), runs the payload (which almost never returns) and
> >> then proceeds out to userland, same way fork(2) would've done."
> >> That way kernel_execve() doesn't have to do anything magical.
> >> 
> >> Al, digging through the old notes and current call graph...
> >
> > 	FWIW, the major assumption back then had been that get_signal(),
> > signal_delivered() and all associated machinery (including coredumps)
> > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.
> >
> > 	And "has complete registers on stack" is only a part of that;
> > there was other fun stuff in the area ;-/  Do we want coredumps for
> > those, and if we do, will the de_thread stuff work there?
> 
> Do we want coredumps from processes that use io_uring? yes
> Exactly what we want from io_uring threads is less clear.  We can't
> really give much that is meaningful beyond the thread ids of the
> io_uring threads.
> 
> What problems do are you seeing beyond the missing registers on the
> stack for kernel threads?
> 
> I don't immediately see the connection between coredumps and de_thread.
> 
> The function de_thread arranges for the fatal_signal_pending to be true,
> and that should work just fine for io_uring threads.  The io_uring
> threads process the fatal_signal with get_signal and then proceed to
> exit eventually calling do_exit.

I would like to see the testing in cases when the io-uring thread is
the one getting hit by initial signal and when it's the normal one
with associated io-uring ones.  The thread-collecting logics at least
used to depend upon fairly subtle assumptions, and "kernel threads
obviously can't show up as candidates" used to narrow the analysis
down...

In any case, WTF would we allow reads or writes to *any* registers of
such threads?  It's not as simple as "just return zeroes", BTW - the
values allowed in special registers might have non-trivial constraints
on them.  The same goes for coredump - we don't _have_ registers to
dump for those, period.

Looks like the first things to do would be
	* prohibit ptrace accessing any regsets of worker threads
	* make coredump skip all register notes for those

Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into
ptrace_notify() - explicit CLONE_UNTRACED for the former and zero
current->ptrace in the caller of the latter.  So fork and exec side
has ptrace_event() crap limited to real syscalls.

It's seccomp[1] and exit-related stuff that are messy...

[1] "never trust somebody who introduces himself as Honest Joe and keeps
carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY,
etc.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 19:45                             ` Al Viro
@ 2021-06-21 23:14                               ` Linus Torvalds
  2021-06-21 23:23                                 ` Al Viro
                                                   ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-21 23:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook,
	Tetsuo Handa

On Mon, Jun 21, 2021 at 12:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Looks like sys_exit() and do_group_exit() would be the two places to
> > do it (do_group_exit() would handle the signal case and
> > sys_group_exit()).
>
> Maybe...  I'm digging through that pile right now, will follow up when
> I get a reasonably complete picture

We might have another possible way to solve this:

 (a) make it the rule that everybody always saves the full (integer)
register set in pt_regs

 (b) make m68k just always create that switch-stack for all system
calls (it's really not that big, I think it's like six words or
something)

 (c) admit that alpha is broken, but nobody really cares

> In the meanwhile, do kernel/kthread.c uses look even remotely sane?
> Intentional - sure, but it really looks wrong to use thread exit code
> as communication channel there...

I really doubt that it is even "intentional".

I think it's "use some errno as a random exit code" and nobody ever
really thought about it, or thought about how that doesn't really
work. People are used to the error numbers, not thinking about how
do_exit() doesn't take an error number, but a signal number (and an
8-bit positive error code in bits 8-15).

Because no, it's not even remotely sane.

I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make
more sense. And the -ENOMEM might be SIGBUS, perhaps.

It does look like the usermode-helper code does save the exit code
with things like

                kernel_wait(pid, &sub_info->retval);

and I see call_usermodehelper_exec() doing

        retval = sub_info->retval;

and treating it as an error code. But I think those have never been
tested with that (bogus) exit code thing from kernel_wait(), because
it wouldn't have worked.  It has only ever been tested with the (real)
exit code things like

                if (pid < 0) {
                        sub_info->retval = pid;

which does actually assign a negative error code to it.

So I think that

                kernel_wait(pid, &sub_info->retval);

line is buggy, and should be something like

                int wstatus;
                kernel_wait(pid, &wstatus);
                sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0;

or something.

            Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 20:03                             ` Eric W. Biederman
@ 2021-06-21 23:15                               ` Linus Torvalds
  2021-06-22 20:52                                 ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-21 23:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Mon, Jun 21, 2021 at 1:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> For other ptrace_event calls I am playing with seeing if I can split
> them in two.  Like sending a signal.  So that we can have perform all
> of the work in get_signal.

That sounds like the right model, but I don't think it works.
Particularly not for exit(). The second phase will never happen.

              Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 23:14                               ` Linus Torvalds
@ 2021-06-21 23:23                                 ` Al Viro
  2021-06-21 23:36                                   ` Linus Torvalds
  2021-06-22  0:01                                 ` Michael Schmitz
  2021-06-22 20:04                                 ` Michael Schmitz
  2 siblings, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-21 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook,
	Tetsuo Handa

On Mon, Jun 21, 2021 at 04:14:36PM -0700, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 12:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Looks like sys_exit() and do_group_exit() would be the two places to
> > > do it (do_group_exit() would handle the signal case and
> > > sys_group_exit()).
> >
> > Maybe...  I'm digging through that pile right now, will follow up when
> > I get a reasonably complete picture
> 
> We might have another possible way to solve this:
> 
>  (a) make it the rule that everybody always saves the full (integer)
> register set in pt_regs
> 
>  (b) make m68k just always create that switch-stack for all system
> calls (it's really not that big, I think it's like six words or
> something)
> 
>  (c) admit that alpha is broken, but nobody really cares

	How would it help e.g. oopsen on the way out of timer interrupts?
IMO we simply shouldn't allow ptrace access if the tracee is in that kind
of state, on any architecture...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 19:24                           ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Al Viro
@ 2021-06-21 23:24                             ` Michael Schmitz
  0 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-21 23:24 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Hi Al,

On 22/06/21 7:24 am, Al Viro wrote:
>
>> 	There's a large mess around do_exit() - we have a bunch of
>> callers all over arch/*; if nothing else, I very much doubt that really
>> want to let tracer play with a thread in the middle of die_if_kernel()
>> or similar.
>>
>> We sure as hell do not want to arrange for anything on the kernel
>> stack in such situations, no matter what's done in exit(2)...
> FWIW, on alpha it's die_if_kernel(), do_entUna() and do_page_fault(),
> all in not-from-userland cases.  On m68k - die_if_kernel(), do_page_fault()
> (both for non-from-userland cases) and something really odd - fpsp040_die().
> Exception handling for floating point stuff on 68040?  Looks like it has
Exception handling for emulated floating point instructions, really - 
exceptions happening when excecuting FPU instructions on hardware will 
do the normal exception processing.
> an open-coded copy_to_user()/copy_from_user(), with faults doing hard
> do_exit(SIGSEGV) instead of raising a signal and trying to do something
> sane...

Yes, that's what it does. Not pretty ... though all that using m68k 
copy_to_user()/copy_from_user() would change is returning how many bytes 
could not copied. In contrast to the ifpsp060 code, we could not pass on 
that return status to callers of copyin/copyout in fpsp040, so I don't 
see what sane thing could be done if a fault happens.

(I'd expect the MMU would have raised a bus error and resolved the 
problem by a page fault if possible, before we ever get to this point?)

> I really don't want to try and figure out how painful would it be to
> teach that code how to deal with faults - _testing_ anything in that
> area sure as hell will be.  IIRC, details of recovery from FPU exceptions
> on 68040 in the manual left impression of a minefield...

This is only about faults when moving data from/to user space. FPU 
exceptions are handled elsewhere in the code. So we at least don't have 
to deal with that particular minefield.

Teaching the fpsp040 code to deal with access faults looks horrible 
indeed... let's not go there.

Cheers,

     Michael



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 23:23                                 ` Al Viro
@ 2021-06-21 23:36                                   ` Linus Torvalds
  2021-06-22 21:02                                     ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-21 23:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook,
	Tetsuo Handa

On Mon, Jun 21, 2021 at 4:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         How would it help e.g. oopsen on the way out of timer interrupts?
> IMO we simply shouldn't allow ptrace access if the tracee is in that kind
> of state, on any architecture...

Yeah no, we can't do the "wait for ptrace" when the exit is due to an
oops. Although honestly, we have other cases like that where do_exit()
isn't 100% robust if you kill something in an interrupt. Like all the
locks it leaves locked etc.

So do_exit() from a timer interrupt is going to cause problems
regardless. I agree it's probably a good idea to try to avoid causing
even more with the odd ptrace thing, but I don't think ptrace_event is
some really "fundamental" problem at that point - it's just one detail
among many many.

So I was more thinking of the debug patch for m68k to catch all the
_regular_ cases, and all the other random cases of ptrace_event() or
ptrace_notify().

Although maybe we've really caught them all. The exit case was clearly
missing, and the thread fork case was scrogged. There are patches for
the known problems. The patches I really don't like are the
verification ones to find any unknown ones..

            Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 23:14                               ` Linus Torvalds
  2021-06-21 23:23                                 ` Al Viro
@ 2021-06-22  0:01                                 ` Michael Schmitz
  2021-06-22 20:04                                 ` Michael Schmitz
  2 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-22  0:01 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, Tetsuo Handa

Hi Linus,

On 22/06/21 11:14 am, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 12:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> Looks like sys_exit() and do_group_exit() would be the two places to
>>> do it (do_group_exit() would handle the signal case and
>>> sys_group_exit()).
>> Maybe...  I'm digging through that pile right now, will follow up when
>> I get a reasonably complete picture
> We might have another possible way to solve this:
>
>   (a) make it the rule that everybody always saves the full (integer)
> register set in pt_regs
>
>   (b) make m68k just always create that switch-stack for all system
> calls (it's really not that big, I think it's like six words or
> something)

Correct - six words for registers, one for the return address. Probably 
still a win compared to setting and clearing flag bits all over the 
place in an attempt to catch any as yet undetected unsafe cases of 
ptrace_stop.

I'll have to see how much of a performance impact I can see (not that I 
can even remotely measure that accurately - it's more of a 'does it now 
feel real sluggish' thing).

Cheers,

     Michael

>
>   (c) admit that alpha is broken, but nobody really cares
>
>> In the meanwhile, do kernel/kthread.c uses look even remotely sane?
>> Intentional - sure, but it really looks wrong to use thread exit code
>> as communication channel there...
> I really doubt that it is even "intentional".
>
> I think it's "use some errno as a random exit code" and nobody ever
> really thought about it, or thought about how that doesn't really
> work. People are used to the error numbers, not thinking about how
> do_exit() doesn't take an error number, but a signal number (and an
> 8-bit positive error code in bits 8-15).
>
> Because no, it's not even remotely sane.
>
> I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make
> more sense. And the -ENOMEM might be SIGBUS, perhaps.
>
> It does look like the usermode-helper code does save the exit code
> with things like
>
>                  kernel_wait(pid, &sub_info->retval);
>
> and I see call_usermodehelper_exec() doing
>
>          retval = sub_info->retval;
>
> and treating it as an error code. But I think those have never been
> tested with that (bogus) exit code thing from kernel_wait(), because
> it wouldn't have worked.  It has only ever been tested with the (real)
> exit code things like
>
>                  if (pid < 0) {
>                          sub_info->retval = pid;
>
> which does actually assign a negative error code to it.
>
> So I think that
>
>                  kernel_wait(pid, &sub_info->retval);
>
> line is buggy, and should be something like
>
>                  int wstatus;
>                  kernel_wait(pid, &wstatus);
>                  sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0;
>
> or something.
>
>              Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 23:05                             ` Al Viro
@ 2021-06-22 16:39                               ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-22 16:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote:
>> Al Viro <viro@zeniv.linux.org.uk> writes:
>> 
>> > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
>> >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>> >> 
>> >> > And I think our horrible "kernel threads return to user space when
>> >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
>> >> > mostly of the historical horror sort.
>> >> 
>> >> How would you prefer to handle that, then?  Separate magical path from
>> >> kernel_execve() to switch to userland?  We used to have something of
>> >> that sort, and that had been a real horror...
>> >> 
>> >> As it is, it's "kernel thread is spawned at the point similar to
>> >> ret_from_fork(), runs the payload (which almost never returns) and
>> >> then proceeds out to userland, same way fork(2) would've done."
>> >> That way kernel_execve() doesn't have to do anything magical.
>> >> 
>> >> Al, digging through the old notes and current call graph...
>> >
>> > 	FWIW, the major assumption back then had been that get_signal(),
>> > signal_delivered() and all associated machinery (including coredumps)
>> > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.
>> >
>> > 	And "has complete registers on stack" is only a part of that;
>> > there was other fun stuff in the area ;-/  Do we want coredumps for
>> > those, and if we do, will the de_thread stuff work there?
>> 
>> Do we want coredumps from processes that use io_uring? yes
>> Exactly what we want from io_uring threads is less clear.  We can't
>> really give much that is meaningful beyond the thread ids of the
>> io_uring threads.
>> 
>> What problems do are you seeing beyond the missing registers on the
>> stack for kernel threads?
>> 
>> I don't immediately see the connection between coredumps and de_thread.
>> 
>> The function de_thread arranges for the fatal_signal_pending to be true,
>> and that should work just fine for io_uring threads.  The io_uring
>> threads process the fatal_signal with get_signal and then proceed to
>> exit eventually calling do_exit.
>
> I would like to see the testing in cases when the io-uring thread is
> the one getting hit by initial signal and when it's the normal one
> with associated io-uring ones.  The thread-collecting logics at least
> used to depend upon fairly subtle assumptions, and "kernel threads
> obviously can't show up as candidates" used to narrow the analysis
> down...
>
> In any case, WTF would we allow reads or writes to *any* registers of
> such threads?  It's not as simple as "just return zeroes", BTW - the
> values allowed in special registers might have non-trivial constraints
> on them.  The same goes for coredump - we don't _have_ registers to
> dump for those, period.
>
> Looks like the first things to do would be
> 	* prohibit ptrace accessing any regsets of worker threads
> 	* make coredump skip all register notes for those

Skipping register notes is fine.  Prohibiting ptrace access to any
regsets of worker threads is interesting.  I think that was tried and
shown to confuse gdb.  So the conclusion was just to provide a fake set
of registers.

Which has appears to work up to the point of dealing with architectures
that have their magic caller-saved optimization (like alpha and m68k),
and no check that all of the registers were saved when accessed.  Adding
a dummy switch stack frame for the kernel threads on those architectures
looks like a good/cheap solution at first glance.

> Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into
> ptrace_notify() - explicit CLONE_UNTRACED for the former and zero
> current->ptrace in the caller of the latter.  So fork and exec side
> has ptrace_event() crap limited to real syscalls.

That is where I thought we were.  Thanks for confirming that.

> It's seccomp[1] and exit-related stuff that are messy...
>
> [1] "never trust somebody who introduces himself as Honest Joe and keeps
> carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY,
> etc.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 23:14                               ` Linus Torvalds
  2021-06-21 23:23                                 ` Al Viro
  2021-06-22  0:01                                 ` Michael Schmitz
@ 2021-06-22 20:04                                 ` Michael Schmitz
  2021-06-22 20:18                                   ` Al Viro
  2 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-22 20:04 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa,
	Andreas Schwab

Hi Linus,

On 22/06/21 11:14 am, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 12:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> Looks like sys_exit() and do_group_exit() would be the two places to
>>> do it (do_group_exit() would handle the signal case and
>>> sys_group_exit()).
>> Maybe...  I'm digging through that pile right now, will follow up when
>> I get a reasonably complete picture
> We might have another possible way to solve this:
>
>   (a) make it the rule that everybody always saves the full (integer)
> register set in pt_regs
>
>   (b) make m68k just always create that switch-stack for all system
> calls (it's really not that big, I think it's like six words or
> something)

Turns out that is harder than it looked at first glance (at least for me).

All syscalls that _do_ save the switch stack are currently called 
through wrappers which pull the syscall arguments out of the saved 
pt_regs on the stack (pushing the switch stack after the SAVE_ALL saved 
stuff buries the syscall arguments on the stack, see comment about 
m68k_clone(). We'd have to push the switch stack _first_ when entering 
system_call to leave the syscall arguments in place, but that will 
require further changes to the syscall exit path (currently shared with 
the interrupt exit path). Not to mention the register offset 
calculations in arch/m68k/kernel/ptrace.c, and perhaps a few other 
dependencies that don't come to mind immediately.

We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the 
ordering of the two is only mentioned in a comment. Can we reorder them 
on the stack, as long as we don't change the struct definitions proper?

This will take a little more time to work out and test - certainly not 
before the weekend. I'll send a corrected version of my debug patch 
before that.

Cheers,

     Michael


>
>   (c) admit that alpha is broken, but nobody really cares
>
>> In the meanwhile, do kernel/kthread.c uses look even remotely sane?
>> Intentional - sure, but it really looks wrong to use thread exit code
>> as communication channel there...
> I really doubt that it is even "intentional".
>
> I think it's "use some errno as a random exit code" and nobody ever
> really thought about it, or thought about how that doesn't really
> work. People are used to the error numbers, not thinking about how
> do_exit() doesn't take an error number, but a signal number (and an
> 8-bit positive error code in bits 8-15).
>
> Because no, it's not even remotely sane.
>
> I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make
> more sense. And the -ENOMEM might be SIGBUS, perhaps.
>
> It does look like the usermode-helper code does save the exit code
> with things like
>
>                  kernel_wait(pid, &sub_info->retval);
>
> and I see call_usermodehelper_exec() doing
>
>          retval = sub_info->retval;
>
> and treating it as an error code. But I think those have never been
> tested with that (bogus) exit code thing from kernel_wait(), because
> it wouldn't have worked.  It has only ever been tested with the (real)
> exit code things like
>
>                  if (pid < 0) {
>                          sub_info->retval = pid;
>
> which does actually assign a negative error code to it.
>
> So I think that
>
>                  kernel_wait(pid, &sub_info->retval);
>
> line is buggy, and should be something like
>
>                  int wstatus;
>                  kernel_wait(pid, &wstatus);
>                  sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0;
>
> or something.
>
>              Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-22 20:04                                 ` Michael Schmitz
@ 2021-06-22 20:18                                   ` Al Viro
  2021-06-22 21:57                                     ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-22 20:18 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Linus Torvalds, Eric W. Biederman, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa,
	Andreas Schwab

On Wed, Jun 23, 2021 at 08:04:11AM +1200, Michael Schmitz wrote:

> All syscalls that _do_ save the switch stack are currently called through
> wrappers which pull the syscall arguments out of the saved pt_regs on the
> stack (pushing the switch stack after the SAVE_ALL saved stuff buries the
> syscall arguments on the stack, see comment about m68k_clone(). We'd have to
> push the switch stack _first_ when entering system_call to leave the syscall
> arguments in place, but that will require further changes to the syscall
> exit path (currently shared with the interrupt exit path). Not to mention
> the register offset calculations in arch/m68k/kernel/ptrace.c, and perhaps a
> few other dependencies that don't come to mind immediately.
> 
> We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the ordering
> of the two is only mentioned in a comment. Can we reorder them on the stack,
> as long as we don't change the struct definitions proper?
> 
> This will take a little more time to work out and test - certainly not
> before the weekend. I'll send a corrected version of my debug patch before
> that.

This is insane, *especially* on m68k where you have the mess with different
frame layouts and associated ->stkadj crap (see mangle_kernel_stack() for
the (very) full barfbag).

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 23:15                               ` Linus Torvalds
@ 2021-06-22 20:52                                 ` Eric W. Biederman
  2021-06-23  0:41                                   ` Linus Torvalds
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-22 20:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jun 21, 2021 at 1:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> For other ptrace_event calls I am playing with seeing if I can split
>> them in two.  Like sending a signal.  So that we can have perform all
>> of the work in get_signal.
>
> That sounds like the right model, but I don't think it works.
> Particularly not for exit(). The second phase will never happen.

Playing with it some more I think I have everything working working
except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and
group_exit(2).

Basically in exit sending yourself a signal and then calling do_exit
from the signal handler is not unreasonable, as exit is an ordinary
system call.

I haven't seen anything that ``knows'' that exit(2) or exit_group(2)
will never return and adds a special case in the system call table for
that case.

The complications of exit_group(2) are roughly those of moving
ptrace_event out of do_exit.   They look doable and I am going to look
at that next.

This is not to say that this is the most maintainable way or that we
necessarily want to implement things this way, but I need to look and
see what it looks like.

For purposes of discussion this is my current draft implementation.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c881384517..891812d32b90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1087,6 +1087,7 @@ struct task_struct {
 	struct capture_control		*capture_control;
 #endif
 	/* Ptrace state: */
+	int				stop_code;
 	unsigned long			ptrace_message;
 	kernel_siginfo_t		*last_siginfo;
 
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index b5ebf6c01292..33c50119b193 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -164,18 +164,29 @@ static inline void ptrace_event(int event, unsigned long message)
 	}
 }
 
+static inline bool ptrace_post_event(int event, unsigned long message)
+{
+	bool posted = false;
+	if (unlikely(ptrace_event_enabled(current, event))) {
+		current->ptrace_message = message;
+		current->stop_code = (event << 8) | SIGTRAP;
+		set_tsk_thread_flag(current, TIF_SIGPENDING);
+		posted = true;
+	} else if (event == PTRACE_EVENT_EXEC) {
+		/* legacy EXEC report via SIGTRAP */
+		if ((current->ptrace & (PT_PTRACED|PT_SEIZED)) == PT_PTRACED)
+			send_sig(SIGTRAP, current, 0);
+	}
+	return posted;
+}
+
 /**
- * ptrace_event_pid - possibly stop for a ptrace event notification
- * @event:	%PTRACE_EVENT_* value to report
- * @pid:	process identifier for %PTRACE_GETEVENTMSG to return
- *
- * Check whether @event is enabled and, if so, report @event and @pid
- * to the ptrace parent.  @pid is reported as the pid_t seen from the
- * ptrace parent's pid namespace.
+ * pid_parent_nr - Return the number the parent knows this pid as
+ * @pid:	The struct pid whose numerical value we want
  *
  * Called without locks.
  */
-static inline void ptrace_event_pid(int event, struct pid *pid)
+static inline pid_t pid_parent_nr(struct pid *pid)
 {
 	/*
 	 * FIXME: There's a potential race if a ptracer in a different pid
@@ -183,16 +194,15 @@ static inline void ptrace_event_pid(int event, struct pid *pid)
 	 * when we acquire tasklist_lock in ptrace_stop().  If this happens,
 	 * the ptracer will get a bogus pid from PTRACE_GETEVENTMSG.
 	 */
-	unsigned long message = 0;
+	pid_t nr = 0;
 	struct pid_namespace *ns;
 
 	rcu_read_lock();
 	ns = task_active_pid_ns(rcu_dereference(current->parent));
 	if (ns)
-		message = pid_nr_ns(pid, ns);
+		nr = pid_nr_ns(pid, ns);
 	rcu_read_unlock();
-
-	ptrace_event(event, message);
+	return nr;
 }
 
 /**
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..a2eac3831369 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -97,6 +97,8 @@ extern void exit_mm_release(struct task_struct *, struct mm_struct *);
 /* Remove the current tasks stale references to the old mm_struct on exec() */
 extern void exec_mm_release(struct task_struct *, struct mm_struct *);
 
+extern int wait_for_vfork_done(struct task_struct *child, struct completion *vfork);
+
 #ifdef CONFIG_MEMCG
 extern void mm_update_next_owner(struct mm_struct *mm);
 #else
diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..bb4751d84e2d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1781,7 +1781,7 @@ static int exec_binprm(struct linux_binprm *bprm)
 
 	audit_bprm(bprm);
 	trace_sched_process_exec(current, old_pid, bprm);
-	ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
+	ptrace_post_event(PTRACE_EVENT_EXEC, old_vpid);
 	proc_exec_connector(current);
 	return 0;
 }
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..aeb22a8e4d24 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit);
 
 SYSCALL_DEFINE1(exit, int, error_code)
 {
-	do_exit((error_code&0xff)<<8);
+	long code = (error_code&0xff)<<8;
+	if (!ptrace_post_event(PTRACE_EVENT_EXIT, code))
+		do_exit((error_code&0xff)<<8);
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..8533e056a3d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1266,8 +1266,7 @@ static void complete_vfork_done(struct task_struct *tsk)
 	task_unlock(tsk);
 }
 
-static int wait_for_vfork_done(struct task_struct *child,
-				struct completion *vfork)
+int wait_for_vfork_done(struct task_struct *child, struct completion *vfork)
 {
 	int killed;
 
@@ -2278,7 +2277,8 @@ static __latent_entropy struct task_struct *copy_process(
 
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
-		ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
+		ptrace_init_task(p, (clone_flags & CLONE_PTRACE) ||
+				 (trace && ptrace_event_enabled(current, trace)));
 
 		init_task_pid(p, PIDTYPE_PID, pid);
 		if (thread_group_leader(p)) {
@@ -2462,7 +2462,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 pid_t kernel_clone(struct kernel_clone_args *args)
 {
 	u64 clone_flags = args->flags;
-	struct completion vfork;
+	unsigned long message;
 	struct pid *pid;
 	struct task_struct *p;
 	int trace = 0;
@@ -2495,9 +2495,6 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 			trace = PTRACE_EVENT_CLONE;
 		else
 			trace = PTRACE_EVENT_FORK;
-
-		if (likely(!ptrace_event_enabled(current, trace)))
-			trace = 0;
 	}
 
 	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
@@ -2512,30 +2509,27 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 	 */
 	trace_sched_process_fork(current, p);
 
-	pid = get_task_pid(p, PIDTYPE_PID);
+	pid = task_pid(p);
 	nr = pid_vnr(pid);
+	message = pid_parent_nr(pid);
 
 	if (clone_flags & CLONE_PARENT_SETTID)
 		put_user(nr, args->parent_tid);
 
-	if (clone_flags & CLONE_VFORK) {
-		p->vfork_done = &vfork;
+	if (!(clone_flags & CLONE_VFORK)) {
+		wake_up_new_task(p);
+		ptrace_post_event(trace, message);
+	}
+	else if (!ptrace_post_event(PTRACE_EVENT_VFORK, (unsigned long)p)) {
+		struct completion vfork;
 		init_completion(&vfork);
+		p->vfork_done = &vfork;
 		get_task_struct(p);
+		wake_up_new_task(p);
+		if (wait_for_vfork_done(p, &vfork))
+			ptrace_post_event(PTRACE_EVENT_VFORK_DONE, message);
 	}
 
-	wake_up_new_task(p);
-
-	/* forking complete and child started to run, tell ptracer */
-	if (unlikely(trace))
-		ptrace_event_pid(trace, pid);
-
-	if (clone_flags & CLONE_VFORK) {
-		if (!wait_for_vfork_done(p, &vfork))
-			ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
-	}
-
-	put_pid(pid);
 	return nr;
 }
 
diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..8ac8c4a31d88 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -155,7 +155,8 @@ static bool recalc_sigpending_tsk(struct task_struct *t)
 	if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE)) ||
 	    PENDING(&t->pending, &t->blocked) ||
 	    PENDING(&t->signal->shared_pending, &t->blocked) ||
-	    cgroup_task_frozen(t)) {
+	    cgroup_task_frozen(t) ||
+	    t->stop_code) {
 		set_tsk_thread_flag(t, TIF_SIGPENDING);
 		return true;
 	}
@@ -2607,6 +2608,39 @@ bool get_signal(struct ksignal *ksig)
 	if (unlikely(current->task_works))
 		task_work_run();
 
+ptrace_event:
+	/* Handle a posted ptrace event */
+	if (unlikely(current->stop_code)) {
+		int stop_code = current->stop_code;
+		unsigned long message = current->ptrace_message;
+		struct completion vfork;
+		struct task_struct *p;
+
+		current->stop_code = 0;
+
+		if (stop_code == PTRACE_EVENT_VFORK) {
+			p = (struct task_struct *)message;
+			get_task_struct(p);
+			current->ptrace_message = pid_parent_nr(task_pid(p));
+			init_completion(&vfork);
+			p->vfork_done = &vfork;
+			wake_up_new_task(p);
+		}
+
+		spin_lock_irq(&sighand->siglock);
+		ptrace_do_notify(SIGTRAP, stop_code, CLD_TRAPPED);
+		spin_unlock_irq(&sighand->siglock);
+
+		if ((stop_code == PTRACE_EVENT_VFORK) &&
+		    wait_for_vfork_done(p, &vfork) &&
+		    ptrace_post_event(PTRACE_EVENT_VFORK_DONE, message))
+			goto ptrace_event;
+
+		if (stop_code == PTRACE_EVENT_EXIT) {
+			do_exit(message);
+		}
+	}
+
 	/*
 	 * For non-generic architectures, check for TIF_NOTIFY_SIGNAL so
 	 * that the arch handlers don't all have to do it. If we get here

Eric

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-21 23:36                                   ` Linus Torvalds
@ 2021-06-22 21:02                                     ` Eric W. Biederman
  2021-06-22 21:48                                       ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-22 21:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, Tetsuo Handa

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jun 21, 2021 at 4:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>         How would it help e.g. oopsen on the way out of timer interrupts?
>> IMO we simply shouldn't allow ptrace access if the tracee is in that kind
>> of state, on any architecture...
>
> Yeah no, we can't do the "wait for ptrace" when the exit is due to an
> oops. Although honestly, we have other cases like that where do_exit()
> isn't 100% robust if you kill something in an interrupt. Like all the
> locks it leaves locked etc.
>
> So do_exit() from a timer interrupt is going to cause problems
> regardless. I agree it's probably a good idea to try to avoid causing
> even more with the odd ptrace thing, but I don't think ptrace_event is
> some really "fundamental" problem at that point - it's just one detail
> among many many.
>
> So I was more thinking of the debug patch for m68k to catch all the
> _regular_ cases, and all the other random cases of ptrace_event() or
> ptrace_notify().
>
> Although maybe we've really caught them all. The exit case was clearly
> missing, and the thread fork case was scrogged. There are patches for
> the known problems. The patches I really don't like are the
> verification ones to find any unknown ones..

We still have nios2 which copied the m68k logic at some point.  I think
that is a processor that is still ``shipping'' and that people might
still be using in new designs.

I haven't looked closely enough to see what the other architectures with
caller saved registers are doing.

The challenging ones are /proc/pid/syscall and seccomp which want to see
all of the system call arguments.  I think every architecture always
saves the system call arguments unconditionally, so those cases are
probably not as interesting.  But they certain look like they could be
trouble.

Eric


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-22 21:02                                     ` Eric W. Biederman
@ 2021-06-22 21:48                                       ` Michael Schmitz
  2021-06-23  5:26                                         ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-22 21:48 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Al Viro, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa,
	John Paul Adrian Glaubitz

Hi Eric,

On 23/06/21 9:02 am, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> So I was more thinking of the debug patch for m68k to catch all the
> _regular_ cases, and all the other random cases of ptrace_event() or
> ptrace_notify().
>
> Although maybe we've really caught them all. The exit case was clearly
> missing, and the thread fork case was scrogged. There are patches for
> the known problems. The patches I really don't like are the
> verification ones to find any unknown ones..
> We still have nios2 which copied the m68k logic at some point.  I think
> that is a processor that is still ``shipping'' and that people might
> still be using in new designs.
>
> I haven't looked closely enough to see what the other architectures with
> caller saved registers are doing.
>
> The challenging ones are /proc/pid/syscall and seccomp which want to see
> all of the system call arguments.  I think every architecture always
> saves the system call arguments unconditionally, so those cases are
> probably not as interesting.  But they certain look like they could be
> trouble.

Seccomp hasn't yet been implemented on m68k, though I'm working on that 
with Adrian. The sole secure_computing() call will happen in 
syscall_trace_enter(), so all system call arguments have been saved on 
the stack.

Haven't looked at /proc/pid/syscall yet ...

Cheers,

     Michael

> Eric
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-22 20:18                                   ` Al Viro
@ 2021-06-22 21:57                                     ` Michael Schmitz
  0 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-22 21:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Eric W. Biederman, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa,
	Andreas Schwab

Hi Al,

On 23/06/21 8:18 am, Al Viro wrote:
> On Wed, Jun 23, 2021 at 08:04:11AM +1200, Michael Schmitz wrote:
>
>> All syscalls that _do_ save the switch stack are currently called through
>> wrappers which pull the syscall arguments out of the saved pt_regs on the
>> stack (pushing the switch stack after the SAVE_ALL saved stuff buries the
>> syscall arguments on the stack, see comment about m68k_clone(). We'd have to
>> push the switch stack _first_ when entering system_call to leave the syscall
>> arguments in place, but that will require further changes to the syscall
>> exit path (currently shared with the interrupt exit path). Not to mention
>> the register offset calculations in arch/m68k/kernel/ptrace.c, and perhaps a
>> few other dependencies that don't come to mind immediately.
>>
>> We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the ordering
>> of the two is only mentioned in a comment. Can we reorder them on the stack,
>> as long as we don't change the struct definitions proper?
>>
>> This will take a little more time to work out and test - certainly not
>> before the weekend. I'll send a corrected version of my debug patch before
>> that.
> This is insane, *especially* on m68k where you have the mess with different
> frame layouts and associated ->stkadj crap (see mangle_kernel_stack() for
> the (very) full barfbag).

Indeed - that's one of the uses of pt_regs and switch_stack that I 
hadn't yet seen.

So it's either leave the stack layout in system calls unchanged (aside 
from the ones that need the extra registers) and protect against 
accidental misuse of registers that weren't saved, with the overhead of 
playing with thread_info->status bits, or tackle the mess of redoing the 
stack layout to save all registers, always (did I already mention that 
I'd need a _lot_ of help from someone more conversant with m68k assembly 
coding for that option?).

Which one of these two barf bags is the fuller one?

Cheers,

     Michael


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-22 20:52                                 ` Eric W. Biederman
@ 2021-06-23  0:41                                   ` Linus Torvalds
  2021-06-23 14:33                                     ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-23  0:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Tue, Jun 22, 2021 at 1:53 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Playing with it some more I think I have everything working working
> except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and
> group_exit(2).
>
> Basically in exit sending yourself a signal and then calling do_exit
> from the signal handler is not unreasonable, as exit is an ordinary
> system call.

Ok, this is a bit odd, but I do like the concept of just making
ptrace_event just post a signal, and have all ptrace things always be
handled at signal time (or the special system call entry/exit, which
is fine too).

> For purposes of discussion this is my current draft implementation.

I didn't check what is so different about exit_group() that you left
that as an exercise for the reader, but if that ends up then removing
the whole "wait synchromously for ptrace" cases for good I don't
_hate_ this. It's a bit odd, but it would be really nice to limit
where ptrace picks up data.

We do end up doing that stuff in "get_signal()", and that means that
we have the interaction with io_uring calling it directly, but it's at
least not a new thing.

               Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-22 21:48                                       ` Michael Schmitz
@ 2021-06-23  5:26                                         ` Michael Schmitz
  2021-06-23 14:36                                           ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-23  5:26 UTC (permalink / raw)
  To: Eric W. Biederman, Linus Torvalds
  Cc: Al Viro, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa,
	John Paul Adrian Glaubitz

Hi Eric,

Am 23.06.2021 um 09:48 schrieb Michael Schmitz:
>>
>> The challenging ones are /proc/pid/syscall and seccomp which want to see
>> all of the system call arguments.  I think every architecture always
>> saves the system call arguments unconditionally, so those cases are
>> probably not as interesting.  But they certain look like they could be
>> trouble.
>
> Seccomp hasn't yet been implemented on m68k, though I'm working on that
> with Adrian. The sole secure_computing() call will happen in
> syscall_trace_enter(), so all system call arguments have been saved on
> the stack.
>
> Haven't looked at /proc/pid/syscall yet ...

Not supported at present (no HAVE_ARCH_TRACEHOOK for m68k). And the 
syscall_get_arguments I wrote for seccomp support only copies the first 
five data registers, which are always saved.

Cheers,

	Michael

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-23  0:41                                   ` Linus Torvalds
@ 2021-06-23 14:33                                     ` Eric W. Biederman
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-23 14:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jun 22, 2021 at 1:53 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Playing with it some more I think I have everything working working
>> except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and
>> group_exit(2).
>>
>> Basically in exit sending yourself a signal and then calling do_exit
>> from the signal handler is not unreasonable, as exit is an ordinary
>> system call.
>
> Ok, this is a bit odd, but I do like the concept of just making
> ptrace_event just post a signal, and have all ptrace things always be
> handled at signal time (or the special system call entry/exit, which
> is fine too).
>
>> For purposes of discussion this is my current draft implementation.
>
> I didn't check what is so different about exit_group() that you left
> that as an exercise for the reader, but if that ends up then removing
> the whole "wait synchromously for ptrace" cases for good I don't
> _hate_ this. It's a bit odd, but it would be really nice to limit
> where ptrace picks up data.

I am still figuring out exit_group.  I am hoping for sometime today.
My intuition tells me I can do it, and I have a sense of what threads I
need to pull to get there.  I just don't know what the code is going to
look like yet.

Basically solving exit_group means moving ptrace_event out of do_exit.

> We do end up doing that stuff in "get_signal()", and that means that
> we have the interaction with io_uring calling it directly, but it's at
> least not a new thing.

The ugliest bit is having to repeat the wait_for_vfork_done both in fork
and in get_signal.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads
  2021-06-23  5:26                                         ` Michael Schmitz
@ 2021-06-23 14:36                                           ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-23 14:36 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Linus Torvalds, Al Viro, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa,
	John Paul Adrian Glaubitz

Michael Schmitz <schmitzmic@gmail.com> writes:

> Hi Eric,
>
> Am 23.06.2021 um 09:48 schrieb Michael Schmitz:
>>>
>>> The challenging ones are /proc/pid/syscall and seccomp which want to see
>>> all of the system call arguments.  I think every architecture always
>>> saves the system call arguments unconditionally, so those cases are
>>> probably not as interesting.  But they certain look like they could be
>>> trouble.
>>
>> Seccomp hasn't yet been implemented on m68k, though I'm working on that
>> with Adrian. The sole secure_computing() call will happen in
>> syscall_trace_enter(), so all system call arguments have been saved on
>> the stack.
>>
>> Haven't looked at /proc/pid/syscall yet ...
>
> Not supported at present (no HAVE_ARCH_TRACEHOOK for m68k). And the
> syscall_get_arguments I wrote for seccomp support only copies the first five
> data registers, which are always saved.

Yes.  It is looking like I can fix everything generically except for
faking user space registers for io_uring threads.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH 0/9] Refactoring exit
  2021-06-23 14:33                                     ` Eric W. Biederman
@ 2021-06-24 18:57                                       ` Eric W. Biederman
  2021-06-24 18:59                                         ` [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) Eric W. Biederman
                                                           ` (9 more replies)
  0 siblings, 10 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


I dug into exit because PTRACE_EVENT_EXIT not being guaranteed to be
called with a stack where ptrace read and write all of the userspace
registers can lead to unfiltered reads and writes of kernel stack
contents.

While looking into it I realized that there are a lot of little races
between all of the ways an exit can be initiated.  I don't know of a way
those races are harmful, but they make the code difficult to reason about.

The solution this set of changes adopts is to implement good primitives
for asynchronous exit and exit_group requests and modifies exit(2) and
exit_group(2) to use those primitives.

The result should be more consistent determination of the reason for an
exit, as well as PTRACE_EVENT_EXIT always being called from a context
(get_signal) where ptrace is guaranteed to be able to read and write
all of the registers.

I believe the set of changes could be justified for the cleanups alone
even if PTRACE_EVENT_EXIT did not need to be moved.  Which makes me
feel good about this approach.

If a way can be found that coredumps can be started from complete_signal
(needed for timely handling of fatal signals) instead of needing to
start in do_coredump for proper synchronization force_siginfo_to_task
and get_signal can be significantly simplified.  As it is a lot of
checks are duplicated to ensure that everything works properly in the
presence of do_coredump.

So far the code has been lightly tested, and the descriptions of some
of the patches are a bit light, but I think this shows the direction
I am aiming to travel for sorting out exit(2) and exit_group(2).

Eric W. Biederman (9):
      signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
      signal/seccomp: Refactor seccomp signal and coredump generation
      signal/seccomp: Dump core when there is only one live thread
      signal: Factor start_group_exit out of complete_signal
      signal/group_exit: Use start_group_exit in place of do_group_exit
      signal: Fold do_group_exit into get_signal fixing io_uring threads
      signal: Make individual tasks exiting a first class concept.
      signal/task_exit: Use start_task_exit in place of do_exit
      signal: Move PTRACE_EVENT_EXIT into get_signal

 arch/sh/kernel/cpu/fpu.c     |  10 +--
 fs/exec.c                    |  10 ++-
 include/linux/sched/jobctl.h |   2 +
 include/linux/sched/signal.h |   5 ++
 include/linux/sched/task.h   |   1 -
 kernel/exit.c                |  41 ++---------
 kernel/seccomp.c             |  45 +++---------
 kernel/signal.c              | 166 ++++++++++++++++++++++++++++++-------------
 8 files changed, 154 insertions(+), 126 deletions(-)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
@ 2021-06-24 18:59                                         ` Eric W. Biederman
  2021-06-24 18:59                                         ` [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation Eric W. Biederman
                                                           ` (8 subsequent siblings)
  9 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


Today the sh code allocates memory the first time a process uses
the fpu.  If that memory allocation fails kill the affected task
with force_sig(SIGKILL) rather than do_group_exit(SIGKILL).

Calling do_group_exit from an exception handler can potentially lead
to locking dead locks as do_group_exit is not designed to be called
from interrupt context.  Instead use force_sig(SIGKILL) to kill
the userspace process.  Sending signals in general and force_sig
in particular has been tested from interrupt context so there
should be no problems.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/sh/kernel/cpu/fpu.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/sh/kernel/cpu/fpu.c b/arch/sh/kernel/cpu/fpu.c
index ae354a2931e7..fd6db0ab1928 100644
--- a/arch/sh/kernel/cpu/fpu.c
+++ b/arch/sh/kernel/cpu/fpu.c
@@ -62,18 +62,20 @@ void fpu_state_restore(struct pt_regs *regs)
 	}
 
 	if (!tsk_used_math(tsk)) {
-		local_irq_enable();
+		int ret;
 		/*
 		 * does a slab alloc which can sleep
 		 */
-		if (init_fpu(tsk)) {
+		local_irq_enable();
+		ret = init_fpu(tsk);
+		local_irq_disable();
+		if (ret) {
 			/*
 			 * ran out of memory!
 			 */
-			do_group_exit(SIGKILL);
+			force_sig(SIGKILL);
 			return;
 		}
-		local_irq_disable();
 	}
 
 	grab_fpu(regs);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
  2021-06-24 18:59                                         ` [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) Eric W. Biederman
@ 2021-06-24 18:59                                         ` Eric W. Biederman
  2021-06-26  3:17                                           ` Kees Cook
  2021-06-24 19:00                                         ` [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread Eric W. Biederman
                                                           ` (7 subsequent siblings)
  9 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


Factor out force_sig_seccomp from the seccomp signal generation and
place it in kernel/signal.c.  The function force_sig_seccomp takes a
paramter force_coredump to indicate that the sigaction field should be
reset to SIGDFL so that a coredump will be generated when the signal
is delivered.

force_sig_seccomp is then used to replace both seccomp_send_sigsys
and seccomp_init_siginfo.

force_sig_info_to_task gains an extra parameter to force using
the default signal action.

With this change seccomp is no longer a special case and there
becomes exactly one place do_coredump is called from.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  1 +
 kernel/seccomp.c             | 43 ++++++++----------------------------
 kernel/signal.c              | 30 +++++++++++++++++++++----
 3 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7f4278fa21fe..774be5d3ac3e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -329,6 +329,7 @@ int force_sig_pkuerr(void __user *addr, u32 pkey);
 int force_sig_perf(void __user *addr, u32 type, u64 sig_data);
 
 int force_sig_ptrace_errno_trap(int errno, void __user *addr);
+int force_sig_seccomp(int syscall, int reason, bool force_coredump);
 
 extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *);
 extern void force_sigsegv(int sig);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 6ecd3f3a52b5..3e06d4628d98 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -920,30 +920,6 @@ void get_seccomp_filter(struct task_struct *tsk)
 	refcount_inc(&orig->users);
 }
 
-static void seccomp_init_siginfo(kernel_siginfo_t *info, int syscall, int reason)
-{
-	clear_siginfo(info);
-	info->si_signo = SIGSYS;
-	info->si_code = SYS_SECCOMP;
-	info->si_call_addr = (void __user *)KSTK_EIP(current);
-	info->si_errno = reason;
-	info->si_arch = syscall_get_arch(current);
-	info->si_syscall = syscall;
-}
-
-/**
- * seccomp_send_sigsys - signals the task to allow in-process syscall emulation
- * @syscall: syscall number to send to userland
- * @reason: filter-supplied reason code to send to userland (via si_errno)
- *
- * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info.
- */
-static void seccomp_send_sigsys(int syscall, int reason)
-{
-	struct kernel_siginfo info;
-	seccomp_init_siginfo(&info, syscall, reason);
-	force_sig_info(&info);
-}
 #endif	/* CONFIG_SECCOMP_FILTER */
 
 /* For use with seccomp_actions_logged */
@@ -1195,7 +1171,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 		/* Show the handler the original registers. */
 		syscall_rollback(current, current_pt_regs());
 		/* Let the filter pass back 16 bits of data. */
-		seccomp_send_sigsys(this_syscall, data);
+		force_sig_seccomp(this_syscall, data, false);
 		goto skip;
 
 	case SECCOMP_RET_TRACE:
@@ -1266,18 +1242,17 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 		/* Dump core only if this is the last remaining thread. */
 		if (action != SECCOMP_RET_KILL_THREAD ||
 		    get_nr_threads(current) == 1) {
-			kernel_siginfo_t info;
-
 			/* Show the original registers in the dump. */
 			syscall_rollback(current, current_pt_regs());
-			/* Trigger a manual coredump since do_exit skips it. */
-			seccomp_init_siginfo(&info, this_syscall, data);
-			do_coredump(&info);
+			/* Trigger a coredump with SIGSYS */
+			force_sig_seccomp(this_syscall, data, true);
+		} else {
+			if (action == SECCOMP_RET_KILL_THREAD)
+				do_exit(SIGSYS);
+			else
+				do_group_exit(SIGSYS);
 		}
-		if (action == SECCOMP_RET_KILL_THREAD)
-			do_exit(SIGSYS);
-		else
-			do_group_exit(SIGSYS);
+		return -1;
 	}
 
 	unreachable();
diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..da37cc4515f2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -54,6 +54,7 @@
 #include <asm/unistd.h>
 #include <asm/siginfo.h>
 #include <asm/cacheflush.h>
+#include <asm/syscall.h>	/* for syscall_get_* */
 
 /*
  * SLAB caches for signal bits.
@@ -1349,7 +1350,7 @@ int do_send_sig_info(int sig, struct kernel_siginfo *info, struct task_struct *p
  * that is why we also clear SIGNAL_UNKILLABLE.
  */
 static int
-force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t)
+force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t, bool sigdfl)
 {
 	unsigned long int flags;
 	int ret, blocked, ignored;
@@ -1360,7 +1361,7 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t)
 	action = &t->sighand->action[sig-1];
 	ignored = action->sa.sa_handler == SIG_IGN;
 	blocked = sigismember(&t->blocked, sig);
-	if (blocked || ignored) {
+	if (blocked || ignored || sigdfl) {
 		action->sa.sa_handler = SIG_DFL;
 		if (blocked) {
 			sigdelset(&t->blocked, sig);
@@ -1381,7 +1382,7 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t)
 
 int force_sig_info(struct kernel_siginfo *info)
 {
-	return force_sig_info_to_task(info, current);
+	return force_sig_info_to_task(info, current, false);
 }
 
 /*
@@ -1712,7 +1713,7 @@ int force_sig_fault_to_task(int sig, int code, void __user *addr
 	info.si_flags = flags;
 	info.si_isr = isr;
 #endif
-	return force_sig_info_to_task(&info, t);
+	return force_sig_info_to_task(&info, t, false);
 }
 
 int force_sig_fault(int sig, int code, void __user *addr
@@ -1820,6 +1821,27 @@ int force_sig_perf(void __user *addr, u32 type, u64 sig_data)
 	return force_sig_info(&info);
 }
 
+/**
+ * force_sig_seccomp - signals the task to allow in-process syscall emulation
+ * @syscall: syscall number to send to userland
+ * @reason: filter-supplied reason code to send to userland (via si_errno)
+ *
+ * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info.
+ */
+int force_sig_seccomp(int syscall, int reason, bool force_coredump)
+{
+	struct kernel_siginfo info;
+
+	clear_siginfo(&info);
+	info.si_signo = SIGSYS;
+	info.si_code = SYS_SECCOMP;
+	info.si_call_addr = (void __user *)KSTK_EIP(current);
+	info.si_errno = reason;
+	info.si_arch = syscall_get_arch(current);
+	info.si_syscall = syscall;
+	return force_sig_info_to_task(&info, current, force_coredump);
+}
+
 /* For the crazy architectures that include trap information in
  * the errno field, instead of an actual errno value.
  */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
  2021-06-24 18:59                                         ` [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) Eric W. Biederman
  2021-06-24 18:59                                         ` [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation Eric W. Biederman
@ 2021-06-24 19:00                                         ` Eric W. Biederman
  2021-06-26  3:20                                           ` Kees Cook
  2021-06-24 19:01                                         ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman
                                                           ` (6 subsequent siblings)
  9 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


Replace get_nr_threads with atomic_read(&current->signal->live) as
that is a more accurate number that is decremented sooner.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/seccomp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3e06d4628d98..5301eca670a0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1241,7 +1241,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 		seccomp_log(this_syscall, SIGSYS, action, true);
 		/* Dump core only if this is the last remaining thread. */
 		if (action != SECCOMP_RET_KILL_THREAD ||
-		    get_nr_threads(current) == 1) {
+		    (atomic_read(&current->signal->live) == 1)) {
 			/* Show the original registers in the dump. */
 			syscall_rollback(current, current_pt_regs());
 			/* Trigger a coredump with SIGSYS */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 4/9] signal: Factor start_group_exit out of complete_signal
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
                                                           ` (2 preceding siblings ...)
  2021-06-24 19:00                                         ` [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread Eric W. Biederman
@ 2021-06-24 19:01                                         ` Eric W. Biederman
  2021-06-24 20:04                                           ` Linus Torvalds
  2021-06-26  3:24                                           ` Kees Cook
  2021-06-24 19:01                                         ` [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit Eric W. Biederman
                                                           ` (5 subsequent siblings)
  9 siblings, 2 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  2 ++
 kernel/signal.c              | 52 +++++++++++++++++++++++++-----------
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 774be5d3ac3e..c007e55cb119 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -428,6 +428,8 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 	signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
 }
 
+void start_group_exit(int exit_code);
+
 void task_join_group_stop(struct task_struct *task);
 
 #ifdef TIF_RESTORE_SIGMASK
diff --git a/kernel/signal.c b/kernel/signal.c
index da37cc4515f2..c79c010ca5f3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1027,6 +1027,42 @@ static inline bool wants_signal(int sig, struct task_struct *p)
 	return task_curr(p) || !task_sigpending(p);
 }
 
+static void start_group_exit_locked(struct signal_struct *signal, int exit_code)
+{
+	/*
+	 * Start a group exit and wake everybody up.
+	 * This way we don't have other threads
+	 * running and doing things after a slower
+	 * thread has the fatal signal pending.
+	 */
+	struct task_struct *t;
+
+	signal->flags = SIGNAL_GROUP_EXIT;
+	signal->group_exit_code = exit_code;
+	signal->group_stop_count = 0;
+	__for_each_thread(signal, t) {
+		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
+
+		/* Don't bother with already dead threads */
+		if (t->exit_state)
+			continue;
+		sigaddset(&t->pending.signal, SIGKILL);
+		signal_wake_up(t, 1);
+	}
+}
+
+void start_group_exit(int exit_code)
+{
+	if (!fatal_signal_pending(current)) {
+		struct sighand_struct *const sighand = current->sighand;
+
+		spin_lock_irq(&sighand->siglock);
+		if (!fatal_signal_pending(current))
+			start_group_exit_locked(current->signal, exit_code);
+		spin_unlock_irq(&sighand->siglock);
+	}
+}
+
 static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 {
 	struct signal_struct *signal = p->signal;
@@ -1076,21 +1112,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 		 * This signal will be fatal to the whole group.
 		 */
 		if (!sig_kernel_coredump(sig)) {
-			/*
-			 * Start a group exit and wake everybody up.
-			 * This way we don't have other threads
-			 * running and doing things after a slower
-			 * thread has the fatal signal pending.
-			 */
-			signal->flags = SIGNAL_GROUP_EXIT;
-			signal->group_exit_code = sig;
-			signal->group_stop_count = 0;
-			t = p;
-			do {
-				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-				sigaddset(&t->pending.signal, SIGKILL);
-				signal_wake_up(t, 1);
-			} while_each_thread(p, t);
+			start_group_exit_locked(signal, sig);
 			return;
 		}
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
                                                           ` (3 preceding siblings ...)
  2021-06-24 19:01                                         ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman
@ 2021-06-24 19:01                                         ` Eric W. Biederman
  2021-06-26  3:35                                           ` Kees Cook
  2021-06-24 19:02                                         ` [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads Eric W. Biederman
                                                           ` (4 subsequent siblings)
  9 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


Make thread exiting uniform by causing all threads to pass through
get_signal when they are exiting.  This simplifies the analysis
of sychronization during exit and guarantees that all full set
of registers will be available for ptrace to examine for
threads that stop at PTRACE_EVENT_EXIT.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/exit.c    | 4 ++--
 kernel/seccomp.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..921519d80b56 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -931,8 +931,8 @@ do_group_exit(int exit_code)
  */
 SYSCALL_DEFINE1(exit_group, int, error_code)
 {
-	do_group_exit((error_code & 0xff) << 8);
-	/* NOTREACHED */
+	start_group_exit((error_code & 0xff) << 8);
+	/* get_signal will call do_exit */
 	return 0;
 }
 
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5301eca670a0..b1c06fd1b205 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1250,7 +1250,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			if (action == SECCOMP_RET_KILL_THREAD)
 				do_exit(SIGSYS);
 			else
-				do_group_exit(SIGSYS);
+				start_group_exit(SIGSYS);
 		}
 		return -1;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
                                                           ` (4 preceding siblings ...)
  2021-06-24 19:01                                         ` [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit Eric W. Biederman
@ 2021-06-24 19:02                                         ` Eric W. Biederman
  2021-06-26  3:42                                           ` Kees Cook
  2021-06-24 19:02                                         ` [PATCH 7/9] signal: Make individual tasks exiting a first class concept Eric W. Biederman
                                                           ` (3 subsequent siblings)
  9 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


Forld do_group_exit into get_signal as it is the last caller.

Move the group_exit logic above the PF_IO_WORKER exit, ensuring
that if an PF_IO_WORKER catches SIGKILL every thread in
the thread group will exit not just the the PF_IO_WORKER.

Now that the information is easily available only set PF_SIGNALED
when it was a signal that caused the exit.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/task.h |  1 -
 kernel/exit.c              | 31 -------------------------------
 kernel/signal.c            | 35 +++++++++++++++++++++++++----------
 3 files changed, 25 insertions(+), 42 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..45525512e3d0 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk)
 {
 }
 #endif
-extern void do_group_exit(int);
 
 extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 921519d80b56..635f434122b7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code)
 	do_exit((error_code&0xff)<<8);
 }
 
-/*
- * Take down every thread in the group.  This is called by fatal signals
- * as well as by sys_exit_group (below).
- */
-void
-do_group_exit(int exit_code)
-{
-	struct signal_struct *sig = current->signal;
-
-	BUG_ON(exit_code & 0x80); /* core dumps don't get here */
-
-	if (signal_group_exit(sig))
-		exit_code = sig->group_exit_code;
-	else if (!thread_group_empty(current)) {
-		struct sighand_struct *const sighand = current->sighand;
-
-		spin_lock_irq(&sighand->siglock);
-		if (signal_group_exit(sig))
-			/* Another thread got here before we took the lock.  */
-			exit_code = sig->group_exit_code;
-		else {
-			sig->group_exit_code = exit_code;
-			sig->flags = SIGNAL_GROUP_EXIT;
-			zap_other_threads(current);
-		}
-		spin_unlock_irq(&sighand->siglock);
-	}
-
-	do_exit(exit_code);
-	/* NOTREACHED */
-}
 
 /*
  * this kills every thread in the thread group. Note that any externally
diff --git a/kernel/signal.c b/kernel/signal.c
index c79c010ca5f3..95a076af600a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig)
 {
 	struct sighand_struct *sighand = current->sighand;
 	struct signal_struct *signal = current->signal;
+	int exit_code;
 	int signr;
 
 	if (unlikely(current->task_works))
@@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig)
 		/*
 		 * Anything else is fatal, maybe with a core dump.
 		 */
-		current->flags |= PF_SIGNALED;
-
 		if (sig_kernel_coredump(signr)) {
 			if (print_fatal_signals)
 				print_fatal_signal(ksig->info.si_signo);
@@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig)
 			/*
 			 * If it was able to dump core, this kills all
 			 * other threads in the group and synchronizes with
-			 * their demise.  If we lost the race with another
-			 * thread getting here, it set group_exit_code
-			 * first and our do_group_exit call below will use
-			 * that value and ignore the one we pass it.
+			 * their demise.  If  another thread makes it
+			 * to do_coredump first, it will set group_exit_code
+			 * which will be passed to do_exit.
 			 */
 			do_coredump(&ksig->info);
 		}
 
+		/*
+		 * Death signals, no core dump.
+		 */
+		exit_code = signr;
+		if (signal_group_exit(signal)) {
+			exit_code = signal->group_exit_code;
+		} else {
+			spin_lock_irq(&sighand->siglock);
+			if (signal_group_exit(signal)) {
+				/* Another thread got here before we took the lock.  */
+				exit_code = signal->group_exit_code;
+			} else {
+				start_group_exit_locked(signal, exit_code);
+			}
+			spin_unlock_irq(&sighand->siglock);
+		}
+
+		if (exit_code & 0x7f)
+			current->flags |= PF_SIGNALED;
+
 		/*
 		 * PF_IO_WORKER threads will catch and exit on fatal signals
 		 * themselves. They have cleanup that must be performed, so
@@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig)
 		if (current->flags & PF_IO_WORKER)
 			goto out;
 
-		/*
-		 * Death signals, no core dump.
-		 */
-		do_group_exit(ksig->info.si_signo);
+		do_exit(exit_code);
 		/* NOTREACHED */
 	}
 	spin_unlock_irq(&sighand->siglock);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 7/9] signal: Make individual tasks exiting a first class concept.
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
                                                           ` (5 preceding siblings ...)
  2021-06-24 19:02                                         ` [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads Eric W. Biederman
@ 2021-06-24 19:02                                         ` Eric W. Biederman
  2021-06-24 20:11                                           ` Linus Torvalds
  2021-06-24 19:03                                         ` [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit Eric W. Biederman
                                                           ` (2 subsequent siblings)
  9 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


Implement start_task_exit_locked and rewrite the de_thread logic
in exec using it.

Calling start_task_exit_locked is equivalent to asyncrhonously
calling exit(2) aka pthread_exit on a task.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 10 +++++++++-
 include/linux/sched/jobctl.h |  2 ++
 include/linux/sched/signal.h |  1 +
 kernel/signal.c              | 37 ++++++++++++++++--------------------
 4 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..b6f50213f0a0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1040,6 +1040,7 @@ static int de_thread(struct task_struct *tsk)
 	struct signal_struct *sig = tsk->signal;
 	struct sighand_struct *oldsighand = tsk->sighand;
 	spinlock_t *lock = &oldsighand->siglock;
+	struct task_struct *t;
 
 	if (thread_group_empty(tsk))
 		goto no_thread_group;
@@ -1058,7 +1059,14 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	sig->group_exit_task = tsk;
-	sig->notify_count = zap_other_threads(tsk);
+	sig->group_stop_count = 0;
+	sig->notify_count = 0;
+	__for_each_thread(sig, t) {
+		if (t == tsk)
+			continue;
+		sig->notify_count++;
+		start_task_exit_locked(t, SIGKILL);
+	}
 	if (!thread_group_leader(tsk))
 		sig->notify_count--;
 
diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index fa067de9f1a9..e94833b0c819 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,6 +19,7 @@ struct task_struct;
 #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
 #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
 #define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
+#define JOBCTL_TASK_EXITING_BIT 31	/* the task is exiting */
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -28,6 +29,7 @@ struct task_struct;
 #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
 #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
 #define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_TASK_EXITING	(1UL << JOBCTL_TASK_EXITING_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
 #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index c007e55cb119..a958381ba4a9 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -429,6 +429,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 }
 
 void start_group_exit(int exit_code);
+void start_task_exit_locked(struct task_struct *task, int exit_code);
 
 void task_join_group_stop(struct task_struct *task);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 95a076af600a..afbc001220dd 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -264,6 +264,12 @@ static inline void print_dropped_signal(int sig)
 				current->comm, current->pid, sig);
 }
 
+static void task_set_jobctl_exiting(struct task_struct *task, int exit_code)
+{
+	WARN_ON_ONCE(task->jobctl & ~JOBCTL_STOP_SIGMASK);
+	task->jobctl = JOBCTL_TASK_EXITING | (exit_code & JOBCTL_STOP_SIGMASK);
+}
+
 /**
  * task_set_jobctl_pending - set jobctl pending bits
  * @task: target task
@@ -1407,28 +1413,15 @@ int force_sig_info(struct kernel_siginfo *info)
 	return force_sig_info_to_task(info, current, false);
 }
 
-/*
- * Nuke all other threads in the group.
- */
-int zap_other_threads(struct task_struct *p)
+void start_task_exit_locked(struct task_struct *task, int exit_code)
 {
-	struct task_struct *t = p;
-	int count = 0;
-
-	p->signal->group_stop_count = 0;
-
-	while_each_thread(p, t) {
-		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		count++;
-
-		/* Don't bother with already dead threads */
-		if (t->exit_state)
-			continue;
-		sigaddset(&t->pending.signal, SIGKILL);
-		signal_wake_up(t, 1);
+	task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
+	/* Only bother with threads that might be alive */
+	if (!task->exit_state) {
+		task_set_jobctl_exiting(task, exit_code);
+		sigaddset(&task->pending.signal, SIGKILL);
+		signal_wake_up(task, 1);
 	}
-
-	return count;
 }
 
 struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
@@ -2714,7 +2707,7 @@ bool get_signal(struct ksignal *ksig)
 	}
 
 	/* Has this task already been marked for death? */
-	if (signal_group_exit(signal)) {
+	if (signal_group_exit(signal) || (current->jobctl & JOBCTL_TASK_EXITING)) {
 		ksig->info.si_signo = signr = SIGKILL;
 		sigdelset(&current->pending.signal, SIGKILL);
 		trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
@@ -2874,6 +2867,8 @@ bool get_signal(struct ksignal *ksig)
 			if (signal_group_exit(signal)) {
 				/* Another thread got here before we took the lock.  */
 				exit_code = signal->group_exit_code;
+			} else if (current->jobctl & JOBCTL_TASK_EXITING) {
+				exit_code = current->jobctl & JOBCTL_STOP_SIGMASK;
 			} else {
 				start_group_exit_locked(signal, exit_code);
 			}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
                                                           ` (6 preceding siblings ...)
  2021-06-24 19:02                                         ` [PATCH 7/9] signal: Make individual tasks exiting a first class concept Eric W. Biederman
@ 2021-06-24 19:03                                         ` Eric W. Biederman
  2021-06-26  5:56                                           ` Kees Cook
  2021-06-24 19:03                                         ` [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal Eric W. Biederman
  2021-06-24 22:45                                         ` [PATCH 0/9] Refactoring exit Al Viro
  9 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


Reuse start_task_exit_locked to implement start_task_exit.

Simplify the exit logic by having all exits go through get_signal.
This simplifies the analysis of syncrhonization during exit and
gurantees a full set of registers will be available for ptrace to
examine at PTRACE_EVENT_EXIT.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  1 +
 kernel/exit.c                |  4 +++-
 kernel/seccomp.c             |  2 +-
 kernel/signal.c              | 12 ++++++++++++
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index a958381ba4a9..3f4e69c019b7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -430,6 +430,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 
 void start_group_exit(int exit_code);
 void start_task_exit_locked(struct task_struct *task, int exit_code);
+void start_task_exit(int exit_code);
 
 void task_join_group_stop(struct task_struct *task);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 635f434122b7..51e0c82b3f7d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit);
 
 SYSCALL_DEFINE1(exit, int, error_code)
 {
-	do_exit((error_code&0xff)<<8);
+	start_task_exit((error_code&0xff)<<8);
+	/* get_signal will call do_exit */
+	return 0;
 }
 
 
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b1c06fd1b205..e0c4c123a8bf 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1248,7 +1248,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			force_sig_seccomp(this_syscall, data, true);
 		} else {
 			if (action == SECCOMP_RET_KILL_THREAD)
-				do_exit(SIGSYS);
+				start_task_exit(SIGSYS);
 			else
 				start_group_exit(SIGSYS);
 		}
diff --git a/kernel/signal.c b/kernel/signal.c
index afbc001220dd..63fda9b6bbf9 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1424,6 +1424,18 @@ void start_task_exit_locked(struct task_struct *task, int exit_code)
 	}
 }
 
+void start_task_exit(int exit_code)
+{
+	struct task_struct *task = current;
+	if (!fatal_signal_pending(task)) {
+		struct sighand_struct *const sighand = task->sighand;
+		spin_lock_irq(&sighand->siglock);
+		if (!fatal_signal_pending(current))
+			start_task_exit_locked(task, exit_code);
+		spin_unlock_irq(&sighand->siglock);
+	}
+}
+
 struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
 					   unsigned long *flags)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
                                                           ` (7 preceding siblings ...)
  2021-06-24 19:03                                         ` [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit Eric W. Biederman
@ 2021-06-24 19:03                                         ` Eric W. Biederman
  2021-06-24 22:45                                         ` [PATCH 0/9] Refactoring exit Al Viro
  9 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook


This ensures that we always have all full set of registers available when
PTRACE_EVENT_EXIT is called.  Something that is not guaranteed for callers
of do_exit.

Additionally this guarantees PTRACE_EVENT_EXIT will not cause havoc
with abnormal exits.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/exit.c   | 2 --
 kernel/signal.c | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 51e0c82b3f7d..309f1d71e340 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -763,8 +763,6 @@ void __noreturn do_exit(long code)
 	profile_task_exit(tsk);
 	kcov_task_exit(tsk);
 
-	ptrace_event(PTRACE_EVENT_EXIT, code);
-
 	validate_creds_for_do_exit(tsk);
 
 	/*
diff --git a/kernel/signal.c b/kernel/signal.c
index 63fda9b6bbf9..7214331836bc 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2890,6 +2890,8 @@ bool get_signal(struct ksignal *ksig)
 		if (exit_code & 0x7f)
 			current->flags |= PF_SIGNALED;
 
+		ptrace_event(PTRACE_EVENT_EXIT, exit_code);
+
 		/*
 		 * PF_IO_WORKER threads will catch and exit on fatal signals
 		 * themselves. They have cleanup that must be performed, so
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/9] signal: Factor start_group_exit out of complete_signal
  2021-06-24 19:01                                         ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman
@ 2021-06-24 20:04                                           ` Linus Torvalds
  2021-06-26  3:24                                           ` Kees Cook
  1 sibling, 0 replies; 119+ messages in thread
From: Linus Torvalds @ 2021-06-24 20:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

I don't really mind the patch, but this patch doesn't actually do what
it says it does.

It factors out start_group_exit_locked() - which all looks good.

But then it also creates that new start_group_exit() function and
makes the declaration for it, and nothing actually uses it. Yet.

I'd do that second part later when you actually introduce the use in
the next patch (5/9).

Hmm?

           Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 7/9] signal: Make individual tasks exiting a first class concept.
  2021-06-24 19:02                                         ` [PATCH 7/9] signal: Make individual tasks exiting a first class concept Eric W. Biederman
@ 2021-06-24 20:11                                           ` Linus Torvalds
  2021-06-24 21:37                                             ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Linus Torvalds @ 2021-06-24 20:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Thu, Jun 24, 2021 at 12:03 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Implement start_task_exit_locked and rewrite the de_thread logic
> in exec using it.
>
> Calling start_task_exit_locked is equivalent to asyncrhonously
> calling exit(2) aka pthread_exit on a task.

Ok, so this is the patch that makes me go "Yeah, this seems to all go together".

The whole "start_exit()" thing seemed fairly sane as an interesting
concept to the whole ptrace notification thing, but this one actually
made me think it makes conceptual sense and how we had exactly that
"start exit asynchronously" case already in zap_other_threads().

So doing that zap_other_threads() as that async exit makes me just
thin kthat yes, this series is the right thing, because it not only
cleans up the ptrace condition, it makes sense in this entirely
unrelated area too.

So I think I'm convinced. I'd like Oleg in particular to Ack this
series, and Al to look it over, but I do think this is the right
direction.

           Linus

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 7/9] signal: Make individual tasks exiting a first class concept.
  2021-06-24 20:11                                           ` Linus Torvalds
@ 2021-06-24 21:37                                             ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-24 21:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Tejun Heo, Kees Cook

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jun 24, 2021 at 12:03 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Implement start_task_exit_locked and rewrite the de_thread logic
>> in exec using it.
>>
>> Calling start_task_exit_locked is equivalent to asyncrhonously
>> calling exit(2) aka pthread_exit on a task.
>
> Ok, so this is the patch that makes me go "Yeah, this seems to all go together".
>
> The whole "start_exit()" thing seemed fairly sane as an interesting
> concept to the whole ptrace notification thing, but this one actually
> made me think it makes conceptual sense and how we had exactly that
> "start exit asynchronously" case already in zap_other_threads().
>
> So doing that zap_other_threads() as that async exit makes me just
> thin kthat yes, this series is the right thing, because it not only
> cleans up the ptrace condition, it makes sense in this entirely
> unrelated area too.
>
> So I think I'm convinced. I'd like Oleg in particular to Ack this
> series, and Al to look it over, but I do think this is the right
> direction.

Thanks.

It took a bit of exploration and playing with things to get here,
but I had the same sense.

Next round I will see if I can clean up the patches a bit more.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
                                                           ` (8 preceding siblings ...)
  2021-06-24 19:03                                         ` [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal Eric W. Biederman
@ 2021-06-24 22:45                                         ` Al Viro
  2021-06-27 22:13                                           ` Al Viro
  2021-06-28 19:02                                           ` [PATCH 0/9] Refactoring exit Eric W. Biederman
  9 siblings, 2 replies; 119+ messages in thread
From: Al Viro @ 2021-06-24 22:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Thu, Jun 24, 2021 at 01:57:35PM -0500, Eric W. Biederman wrote:

> So far the code has been lightly tested, and the descriptions of some
> of the patches are a bit light, but I think this shows the direction
> I am aiming to travel for sorting out exit(2) and exit_group(2).

FWIW, here's the current picture for do_exit(), aside of exit(2) and do_exit_group():

1) stuff that is clearly oops-like -
        alpha:die_if_kernel() alpha:do_entUna() alpha:do_page_fault() arm:oops_end()
        arm:__do_kernel_fault() arm64:die() arm64:die_kernel_fault() csky:alignment()
        csky:die() csky:no_context() h8300:die() h8300:do_page_fault() hexagon:die()
        ia64:die() i64:ia64_do_page_fault() m68k:die_if_kernel() m68k:send_fault_sig()
        microblaze:die() mips:die() nds32:handle_fpu_exception() nds32:die()
        nds32:unhandled_interruption() nds32:unhandled_exceptions() nds32:do_revinsn()
        nds32:do_page_fault() nios:die() openrisc:die() openrisc:do_page_fault()
        parisc:die_if_kernel() ppc:oops_end() riscv:die() riscv:die_kernel_fault()
        s390:die() s390:do_no_context() s390:do_low_address() sh:die()
        sparc32:die_if_kernel() sparc32:do_sparc_fault() sparc64:die_if_kernel()
        x86:rewind_stack_do_exit() xtensa:die() xtensa:bad_page_fault()
We really do not want ptrace anywhere near any of those and we do not want
any of that to return; this shit would better be handled right there and
there - no "post a fatal signal" would do.

2) sparc32 playing silly buggers with SIGILL in case when signal delivery
can't get a valid sigframe.  The regular variant for that kind of stuff
is forced SIGSEGV from failure case of signal_setup_done().  We could force
that SIGILL instead of do_exit() there (and report failure from sigframe
setup), but I suspect that we'll get SIGSEGV override that SIGILL, with
user-visible behaviour change.  Triggered by altstack overflow on sparc32;
sparc64 gets SIGSEGV in the same situation, just like everybody else.

3) ppc swapcontext(2).  Normal syscall, on failure results in exit(SIGSEGV).
Not sure if we want to post signal here - exposing the caller to results
of failure might be... interesting.  And I really don't know if we want
to allow ptrace() to poke around in the results of such failure.  That's
a question for ppc maintainers.

4) sparc32:try_to_clear_window_buffer().  Probably could force SIGSEGV
instead of do_exit() there, but that might need a bit of massage in
asm glue - it's called on the way out of kernel, right before handling
signals.  I'd like comments from davem on that one, though.

5) in xtensa fast_syscall_spill_registers() stuff.  Might or might not
be similar to the above.

6) sparc64 in tsb_grow() - looks like "impossible case, kill the sucker
dead if that ever happens".  Not sure if it's reachable at all.

7) s390 copy_thread() is doing something interesting in kernel thread
case - frame->childregs.gprs[11] = (unsigned long)do_exit;
AFAICS, had been unused since 30dcb0996e40, when s390 switched to new
kernel_execve() semantics and kernel_thread_starter stopped using r11
(or proceeding to do_exit() in the first place).  Ought to be removed,
if s390 folks ACK that.

8) x86:emulate_vsyscall(), x86:save_v86_state(), m68k:fpsp040_die(),
mips:bad_stack(), s390:__s390_handle_mcck(), ia64:mca_handler_bh(),
s390:default_trap_handler() - fuck knows.

9) seccomp stuff - this one should *NOT* be switched to posting signals;
it's on syscall_trace_enter() paths and we'd better have signal-equivalent
environment there.  We sure as hell do have regular "stop and let tracer
poke around" in the same area - that's where strace is poking around.

10) there's a (moderate) bunch of places all over the tree where we
have kthread() payload hit do_exit(), with or without complete() or
module_put().  No ptrace stuff is going to be hit there and I see no
point in switching those to posting anything.  In particular,
module_put_and_exit() sure as hell does *NOT* want to return to caller -
it might've been unmapped by the point we are done.  This do_exit()
should really be noreturn.

11) abuses in kernel/kthread.c; AFAICS, it's misused as a mechanism
to return an error value to parent.  No ptrace possible (parent
definitely not traced) and I don't see any point in delaying the
handling of that do_exit() either (same as with the execve failure
in call_usermodehelper_exec_async()).

12) io-uring threads hitting do_exit().  These, apparently, can be
ptraced...

13) there's bdflush(1, whatever), which is equivalent to exit(0).
IMO it's long past the time to simply remove the sucker.

14) reboot(2) stuff.  No idea.

15) syscall_user_dispatch().  Didn't have time to look through that
stuff in details yet, so no idea at the moment.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation
  2021-06-24 18:59                                         ` [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation Eric W. Biederman
@ 2021-06-26  3:17                                           ` Kees Cook
  2021-06-28 19:21                                             ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Kees Cook @ 2021-06-26  3:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo

On Thu, Jun 24, 2021 at 01:59:55PM -0500, Eric W. Biederman wrote:
> 
> Factor out force_sig_seccomp from the seccomp signal generation and
> place it in kernel/signal.c.  The function force_sig_seccomp takes a
> paramter force_coredump to indicate that the sigaction field should be
> reset to SIGDFL so that a coredump will be generated when the signal
> is delivered.

Ah! This is the part I missed when I was originally trying to figure
out the coredump stuff. It's the need for setting a default handler
(i.e. doing a coredump)?

> force_sig_seccomp is then used to replace both seccomp_send_sigsys
> and seccomp_init_siginfo.
> 
> force_sig_info_to_task gains an extra parameter to force using
> the default signal action.
> 
> With this change seccomp is no longer a special case and there
> becomes exactly one place do_coredump is called from.

Looks good to me. This may benefit from force_sig_seccomp() to be wrapped
in an #ifdef CONFIG_SECCOMP.

(This patch reminds me that the seccomp self tests don't check for core
dumps...)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread
  2021-06-24 19:00                                         ` [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread Eric W. Biederman
@ 2021-06-26  3:20                                           ` Kees Cook
  0 siblings, 0 replies; 119+ messages in thread
From: Kees Cook @ 2021-06-26  3:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo

On Thu, Jun 24, 2021 at 02:00:22PM -0500, Eric W. Biederman wrote:
> Replace get_nr_threads with atomic_read(&current->signal->live) as
> that is a more accurate number that is decremented sooner.

Okay, seems fine to me. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 4/9] signal: Factor start_group_exit out of complete_signal
  2021-06-24 19:01                                         ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman
  2021-06-24 20:04                                           ` Linus Torvalds
@ 2021-06-26  3:24                                           ` Kees Cook
  1 sibling, 0 replies; 119+ messages in thread
From: Kees Cook @ 2021-06-26  3:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo

On Thu, Jun 24, 2021 at 02:01:20PM -0500, Eric W. Biederman wrote:
> +static void start_group_exit_locked(struct signal_struct *signal, int exit_code)
> +{
> +	/*
> +	 * Start a group exit and wake everybody up.
> +	 * This way we don't have other threads
> +	 * running and doing things after a slower
> +	 * thread has the fatal signal pending.
> +	 */
> +	struct task_struct *t;
> +
> +	signal->flags = SIGNAL_GROUP_EXIT;
> +	signal->group_exit_code = exit_code;
> +	signal->group_stop_count = 0;
> +	__for_each_thread(signal, t) {
> +		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> +
> +		/* Don't bother with already dead threads */
> +		if (t->exit_state)
> +			continue;
> +		sigaddset(&t->pending.signal, SIGKILL);
> +		signal_wake_up(t, 1);
> +	}

This both extracts it and changes it. For ease-of-review, maybe split
this patch into the move and then the logic changes?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit
  2021-06-24 19:01                                         ` [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit Eric W. Biederman
@ 2021-06-26  3:35                                           ` Kees Cook
  0 siblings, 0 replies; 119+ messages in thread
From: Kees Cook @ 2021-06-26  3:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo

On Thu, Jun 24, 2021 at 02:01:40PM -0500, Eric W. Biederman wrote:
> 
> Make thread exiting uniform by causing all threads to pass through
> get_signal when they are exiting.  This simplifies the analysis
> of sychronization during exit and guarantees that all full set
> of registers will be available for ptrace to examine for
> threads that stop at PTRACE_EVENT_EXIT.

Yeah, cool. I do like making the process lifetime more sensible here. It
always threw me that do_exit*() just stopped execution. :)

For future me, can you add a comment on start_group_exit() that mentions
where final process death happens?

> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/exit.c    | 4 ++--
>  kernel/seccomp.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index fd1c04193e18..921519d80b56 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -931,8 +931,8 @@ do_group_exit(int exit_code)
>   */
>  SYSCALL_DEFINE1(exit_group, int, error_code)
>  {
> -	do_group_exit((error_code & 0xff) << 8);
> -	/* NOTREACHED */
> +	start_group_exit((error_code & 0xff) << 8);
> +	/* get_signal will call do_exit */
>  	return 0;

"0" feels weird here, but I can't think of any better "fail closed"
error code here.

>  }
>  
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 5301eca670a0..b1c06fd1b205 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1250,7 +1250,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  			if (action == SECCOMP_RET_KILL_THREAD)
>  				do_exit(SIGSYS);
>  			else
> -				do_group_exit(SIGSYS);
> +				start_group_exit(SIGSYS);

This could use a similar comment to the syscall's comment, just so I
don't panic when I read this code in like 3 years. ;)

Otherwise, yeah, looks good.

-Kees

>  		}
>  		return -1;
>  	}
> -- 
> 2.20.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads
  2021-06-24 19:02                                         ` [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads Eric W. Biederman
@ 2021-06-26  3:42                                           ` Kees Cook
  2021-06-28 19:25                                             ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Kees Cook @ 2021-06-26  3:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Tejun Heo

On Thu, Jun 24, 2021 at 02:02:16PM -0500, Eric W. Biederman wrote:
> 
> Forld do_group_exit into get_signal as it is the last caller.
> 
> Move the group_exit logic above the PF_IO_WORKER exit, ensuring
> that if an PF_IO_WORKER catches SIGKILL every thread in
> the thread group will exit not just the the PF_IO_WORKER.
> 
> Now that the information is easily available only set PF_SIGNALED
> when it was a signal that caused the exit.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/linux/sched/task.h |  1 -
>  kernel/exit.c              | 31 -------------------------------
>  kernel/signal.c            | 35 +++++++++++++++++++++++++----------
>  3 files changed, 25 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index ef02be869cf2..45525512e3d0 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk)
>  {
>  }
>  #endif
> -extern void do_group_exit(int);
>  
>  extern void exit_files(struct task_struct *);
>  extern void exit_itimers(struct signal_struct *);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 921519d80b56..635f434122b7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code)
>  	do_exit((error_code&0xff)<<8);
>  }
>  
> -/*
> - * Take down every thread in the group.  This is called by fatal signals
> - * as well as by sys_exit_group (below).
> - */
> -void
> -do_group_exit(int exit_code)
> -{
> -	struct signal_struct *sig = current->signal;
> -
> -	BUG_ON(exit_code & 0x80); /* core dumps don't get here */
> -
> -	if (signal_group_exit(sig))
> -		exit_code = sig->group_exit_code;
> -	else if (!thread_group_empty(current)) {
> -		struct sighand_struct *const sighand = current->sighand;
> -
> -		spin_lock_irq(&sighand->siglock);
> -		if (signal_group_exit(sig))
> -			/* Another thread got here before we took the lock.  */
> -			exit_code = sig->group_exit_code;
> -		else {
> -			sig->group_exit_code = exit_code;
> -			sig->flags = SIGNAL_GROUP_EXIT;
> -			zap_other_threads(current);

Oh, now I see it: the "new code" in start_group_exit() is an open-coded
zap_other_threads()? That wasn't clear to me, but makes sense now.

> -		}
> -		spin_unlock_irq(&sighand->siglock);
> -	}
> -
> -	do_exit(exit_code);
> -	/* NOTREACHED */
> -}
>  
>  /*
>   * this kills every thread in the thread group. Note that any externally
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c79c010ca5f3..95a076af600a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig)
>  {
>  	struct sighand_struct *sighand = current->sighand;
>  	struct signal_struct *signal = current->signal;
> +	int exit_code;
>  	int signr;
>  
>  	if (unlikely(current->task_works))
> @@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig)
>  		/*
>  		 * Anything else is fatal, maybe with a core dump.
>  		 */
> -		current->flags |= PF_SIGNALED;
> -
>  		if (sig_kernel_coredump(signr)) {
>  			if (print_fatal_signals)
>  				print_fatal_signal(ksig->info.si_signo);
> @@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig)
>  			/*
>  			 * If it was able to dump core, this kills all
>  			 * other threads in the group and synchronizes with
> -			 * their demise.  If we lost the race with another
> -			 * thread getting here, it set group_exit_code
> -			 * first and our do_group_exit call below will use
> -			 * that value and ignore the one we pass it.
> +			 * their demise.  If  another thread makes it
> +			 * to do_coredump first, it will set group_exit_code
> +			 * which will be passed to do_exit.
>  			 */
>  			do_coredump(&ksig->info);
>  		}
>  
> +		/*
> +		 * Death signals, no core dump.
> +		 */
> +		exit_code = signr;
> +		if (signal_group_exit(signal)) {
> +			exit_code = signal->group_exit_code;
> +		} else {
> +			spin_lock_irq(&sighand->siglock);
> +			if (signal_group_exit(signal)) {
> +				/* Another thread got here before we took the lock.  */
> +				exit_code = signal->group_exit_code;
> +			} else {
> +				start_group_exit_locked(signal, exit_code);

And here's the "if we didn't already do start_group_exit(), do it here".
And that state is entirely captured via the SIGNAL_GROUP_EXIT flag.
Cool.

> +			}
> +			spin_unlock_irq(&sighand->siglock);
> +		}
> +
> +		if (exit_code & 0x7f)
> +			current->flags |= PF_SIGNALED;
> +
>  		/*
>  		 * PF_IO_WORKER threads will catch and exit on fatal signals
>  		 * themselves. They have cleanup that must be performed, so
> @@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig)
>  		if (current->flags & PF_IO_WORKER)
>  			goto out;
>  
> -		/*
> -		 * Death signals, no core dump.
> -		 */
> -		do_group_exit(ksig->info.si_signo);
> +		do_exit(exit_code);
>  		/* NOTREACHED */
>  	}
>  	spin_unlock_irq(&sighand->siglock);
> -- 
> 2.20.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit
  2021-06-24 19:03                                         ` [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit Eric W. Biederman
@ 2021-06-26  5:56                                           ` Kees Cook
  0 siblings, 0 replies; 119+ messages in thread
From: Kees Cook @ 2021-06-26  5:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Tejun Heo

On Thu, Jun 24, 2021 at 02:03:25PM -0500, Eric W. Biederman wrote:
> 
> Reuse start_task_exit_locked to implement start_task_exit.
> 
> Simplify the exit logic by having all exits go through get_signal.
> This simplifies the analysis of syncrhonization during exit and
> gurantees a full set of registers will be available for ptrace to
> examine at PTRACE_EVENT_EXIT.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/linux/sched/signal.h |  1 +
>  kernel/exit.c                |  4 +++-
>  kernel/seccomp.c             |  2 +-
>  kernel/signal.c              | 12 ++++++++++++
>  4 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index a958381ba4a9..3f4e69c019b7 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -430,6 +430,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
>  
>  void start_group_exit(int exit_code);
>  void start_task_exit_locked(struct task_struct *task, int exit_code);
> +void start_task_exit(int exit_code);
>  
>  void task_join_group_stop(struct task_struct *task);
>  
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 635f434122b7..51e0c82b3f7d 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit);
>  
>  SYSCALL_DEFINE1(exit, int, error_code)
>  {
> -	do_exit((error_code&0xff)<<8);
> +	start_task_exit((error_code&0xff)<<8);
> +	/* get_signal will call do_exit */
> +	return 0;
>  }
>  
>  
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b1c06fd1b205..e0c4c123a8bf 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1248,7 +1248,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  			force_sig_seccomp(this_syscall, data, true);
>  		} else {
>  			if (action == SECCOMP_RET_KILL_THREAD)
> -				do_exit(SIGSYS);
> +				start_task_exit(SIGSYS);
>  			else
>  				start_group_exit(SIGSYS);
>  		}

Looks good, yeah.

> diff --git a/kernel/signal.c b/kernel/signal.c
> index afbc001220dd..63fda9b6bbf9 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1424,6 +1424,18 @@ void start_task_exit_locked(struct task_struct *task, int exit_code)
>  	}
>  }
>  
> +void start_task_exit(int exit_code)
> +{
> +	struct task_struct *task = current;
> +	if (!fatal_signal_pending(task)) {
> +		struct sighand_struct *const sighand = task->sighand;
> +		spin_lock_irq(&sighand->siglock);
> +		if (!fatal_signal_pending(current))

efficiency nit: "task" instead of "current" here, yes?

> +			start_task_exit_locked(task, exit_code);
> +		spin_unlock_irq(&sighand->siglock);
> +	}
> +}
> +
>  struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
>  					   unsigned long *flags)
>  {
> -- 
> 2.20.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-24 22:45                                         ` [PATCH 0/9] Refactoring exit Al Viro
@ 2021-06-27 22:13                                           ` Al Viro
  2021-06-27 22:59                                             ` Michael Schmitz
  2021-06-28 19:02                                           ` [PATCH 0/9] Refactoring exit Eric W. Biederman
  1 sibling, 1 reply; 119+ messages in thread
From: Al Viro @ 2021-06-27 22:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote:

> 13) there's bdflush(1, whatever), which is equivalent to exit(0).
> IMO it's long past the time to simply remove the sucker.

Incidentally, calling that from ptraced process on alpha leads to
the same headache for tracer.  _If_ we leave it around, this is
another candidate for "hit yourself with that special signal" -
both alpha and m68k have that syscall, and IMO adding an asm
wrapper for that one is over the top.

Said that, we really ought to bury that thing:

commit 2f268ee88abb33968501a44368db55c63adaad40
Author: Andrew Morton <akpm@digeo.com>
Date:   Sat Dec 14 03:16:29 2002 -0800

    [PATCH] deprecate use of bdflush()
	
    Patch from Robert Love <rml@tech9.net>
		
    We can never get rid of it if we do not deprecate it - so do so and
    print a stern warning to those who still run bdflush daemons.

Deprecated for 18.5 years by now - I seriously suspect that we have
some contributors younger than that...

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-27 22:13                                           ` Al Viro
@ 2021-06-27 22:59                                             ` Michael Schmitz
  2021-06-28  7:31                                               ` Geert Uytterhoeven
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-27 22:59 UTC (permalink / raw)
  To: Al Viro, Eric W. Biederman
  Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov,
	Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, alpha, Geert Uytterhoeven, linux-m68k,
	Arnd Bergmann, Tejun Heo, Kees Cook

On 28/06/21 10:13 am, Al Viro wrote:

> On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote:
>
>> 13) there's bdflush(1, whatever), which is equivalent to exit(0).
>> IMO it's long past the time to simply remove the sucker.
> Incidentally, calling that from ptraced process on alpha leads to
> the same headache for tracer.  _If_ we leave it around, this is
> another candidate for "hit yourself with that special signal" -
> both alpha and m68k have that syscall, and IMO adding an asm
> wrapper for that one is over the top.
>
> Said that, we really ought to bury that thing:
>
> commit 2f268ee88abb33968501a44368db55c63adaad40
> Author: Andrew Morton <akpm@digeo.com>
> Date:   Sat Dec 14 03:16:29 2002 -0800
>
>      [PATCH] deprecate use of bdflush()
> 	
>      Patch from Robert Love <rml@tech9.net>
> 		
>      We can never get rid of it if we do not deprecate it - so do so and
>      print a stern warning to those who still run bdflush daemons.
>
> Deprecated for 18.5 years by now - I seriously suspect that we have
> some contributors younger than that...

Haven't found that warning in over 7 years' worth of console logs, and 
I'm a good candidate for running the oldest userland in existence for m68k.

Time to let it go.

Cheers,

     Michael



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-27 22:59                                             ` Michael Schmitz
@ 2021-06-28  7:31                                               ` Geert Uytterhoeven
  2021-06-28 16:20                                                 ` Eric W. Biederman
  2021-06-28 17:14                                                 ` Michael Schmitz
  0 siblings, 2 replies; 119+ messages in thread
From: Geert Uytterhoeven @ 2021-06-28  7:31 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch,
	Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook

Hi Michael,

On Mon, Jun 28, 2021 at 1:00 AM Michael Schmitz <schmitzmic@gmail.com> wrote:
> On 28/06/21 10:13 am, Al Viro wrote:
> > On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote:
> >
> >> 13) there's bdflush(1, whatever), which is equivalent to exit(0).
> >> IMO it's long past the time to simply remove the sucker.
> > Incidentally, calling that from ptraced process on alpha leads to
> > the same headache for tracer.  _If_ we leave it around, this is
> > another candidate for "hit yourself with that special signal" -
> > both alpha and m68k have that syscall, and IMO adding an asm
> > wrapper for that one is over the top.
> >
> > Said that, we really ought to bury that thing:
> >
> > commit 2f268ee88abb33968501a44368db55c63adaad40
> > Author: Andrew Morton <akpm@digeo.com>
> > Date:   Sat Dec 14 03:16:29 2002 -0800
> >
> >      [PATCH] deprecate use of bdflush()
> >
> >      Patch from Robert Love <rml@tech9.net>
> >
> >      We can never get rid of it if we do not deprecate it - so do so and
> >      print a stern warning to those who still run bdflush daemons.
> >
> > Deprecated for 18.5 years by now - I seriously suspect that we have
> > some contributors younger than that...
>
> Haven't found that warning in over 7 years' worth of console logs, and
> I'm a good candidate for running the oldest userland in existence for m68k.
>
> Time to let it go.

The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
which is a very old ramdisk from right after the m68k a.out to ELF
transition:

    warning: process `update' used the obsolete bdflush system call
    Fix your initscripts?

I still boot it, once in a while.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-28  7:31                                               ` Geert Uytterhoeven
@ 2021-06-28 16:20                                                 ` Eric W. Biederman
  2021-06-28 17:14                                                 ` Michael Schmitz
  1 sibling, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-28 16:20 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Michael Schmitz, Al Viro, Linus Torvalds, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann,
	Tejun Heo, Kees Cook

Geert Uytterhoeven <geert@linux-m68k.org> writes:

> Hi Michael,
>
> On Mon, Jun 28, 2021 at 1:00 AM Michael Schmitz <schmitzmic@gmail.com> wrote:
>> On 28/06/21 10:13 am, Al Viro wrote:
>> > On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote:
>> >
>> >> 13) there's bdflush(1, whatever), which is equivalent to exit(0).
>> >> IMO it's long past the time to simply remove the sucker.
>> > Incidentally, calling that from ptraced process on alpha leads to
>> > the same headache for tracer.  _If_ we leave it around, this is
>> > another candidate for "hit yourself with that special signal" -
>> > both alpha and m68k have that syscall, and IMO adding an asm
>> > wrapper for that one is over the top.
>> >
>> > Said that, we really ought to bury that thing:
>> >
>> > commit 2f268ee88abb33968501a44368db55c63adaad40
>> > Author: Andrew Morton <akpm@digeo.com>
>> > Date:   Sat Dec 14 03:16:29 2002 -0800
>> >
>> >      [PATCH] deprecate use of bdflush()
>> >
>> >      Patch from Robert Love <rml@tech9.net>
>> >
>> >      We can never get rid of it if we do not deprecate it - so do so and
>> >      print a stern warning to those who still run bdflush daemons.
>> >
>> > Deprecated for 18.5 years by now - I seriously suspect that we have
>> > some contributors younger than that...
>>
>> Haven't found that warning in over 7 years' worth of console logs, and
>> I'm a good candidate for running the oldest userland in existence for m68k.
>>
>> Time to let it go.
>
> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
> which is a very old ramdisk from right after the m68k a.out to ELF
> transition:
>
>     warning: process `update' used the obsolete bdflush system call
>     Fix your initscripts?
>
> I still boot it, once in a while.

The only thing left in bdflush is func == 1 calls do_exit(0);

Which is a hack introduced in 2.3.23 aka October of 1999 to force the
userspace process calling bdflush to exit, instead of repeatedly calling
sys_bdflush.

Can you try deleting that func == 1 call and seeing if your old ramdisk
works?

I suspect userspace used to get into a tight spin calling bdflush
func == 1, when that function stopped doing anything.  That was back in
1999 so we are probably safe with out it.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-28  7:31                                               ` Geert Uytterhoeven
  2021-06-28 16:20                                                 ` Eric W. Biederman
@ 2021-06-28 17:14                                                 ` Michael Schmitz
  2021-06-28 19:17                                                   ` Geert Uytterhoeven
  1 sibling, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-28 17:14 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch,
	Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook

Hi Geert,

Am 28.06.2021 um 19:31 schrieb Geert Uytterhoeven:
>> Haven't found that warning in over 7 years' worth of console logs, and
>> I'm a good candidate for running the oldest userland in existence for m68k.
>>
>> Time to let it go.
>
> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
> which is a very old ramdisk from right after the m68k a.out to ELF
> transition:
>
>     warning: process `update' used the obsolete bdflush system call
>     Fix your initscripts?
>
> I still boot it, once in a while.

OK; you take the cake. That ramdisk came to mind when I thought about 
where I'd last seen bdflush, but I've not used it in ages (not sure 14 
MB are enough for that).

The question then is - will bdflush fail gracefully, or spin retrying 
the syscall?

Cheers,

	Michael

>
> Gr{oetje,eeting}s,
>
>                         Geert
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-24 22:45                                         ` [PATCH 0/9] Refactoring exit Al Viro
  2021-06-27 22:13                                           ` Al Viro
@ 2021-06-28 19:02                                           ` Eric W. Biederman
  1 sibling, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-28 19:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Thu, Jun 24, 2021 at 01:57:35PM -0500, Eric W. Biederman wrote:
>
>> So far the code has been lightly tested, and the descriptions of some
>> of the patches are a bit light, but I think this shows the direction
>> I am aiming to travel for sorting out exit(2) and exit_group(2).
>
> FWIW, here's the current picture for do_exit(), aside of exit(2) and do_exit_group():
>
> 1) stuff that is clearly oops-like -
>         alpha:die_if_kernel() alpha:do_entUna() alpha:do_page_fault() arm:oops_end()
>         arm:__do_kernel_fault() arm64:die() arm64:die_kernel_fault() csky:alignment()
>         csky:die() csky:no_context() h8300:die() h8300:do_page_fault() hexagon:die()
>         ia64:die() i64:ia64_do_page_fault() m68k:die_if_kernel() m68k:send_fault_sig()
>         microblaze:die() mips:die() nds32:handle_fpu_exception() nds32:die()
>         nds32:unhandled_interruption() nds32:unhandled_exceptions() nds32:do_revinsn()
>         nds32:do_page_fault() nios:die() openrisc:die() openrisc:do_page_fault()
>         parisc:die_if_kernel() ppc:oops_end() riscv:die() riscv:die_kernel_fault()
>         s390:die() s390:do_no_context() s390:do_low_address() sh:die()
>         sparc32:die_if_kernel() sparc32:do_sparc_fault() sparc64:die_if_kernel()
>         x86:rewind_stack_do_exit() xtensa:die() xtensa:bad_page_fault()
> We really do not want ptrace anywhere near any of those and we do not want
> any of that to return; this shit would better be handled right there and
> there - no "post a fatal signal" would do.

Thanks that makes a good start for digging into these.

I think the distinction I would make is:
- If the kernel is broken use do_task_dead.
- Otherwise cleanup the semantics by using start_group_exit,
  start_task_exit or by just cleaning up the code.
  

Looking at the reboot case it looks like we the code
should have become do_group_exit in 2.5.  I have a suspicion
we have a bunch of similar cases that want to terminate the
entire process, but we simply never updated to deal with
multi-thread processes.

I suspect in the reboot case panic if machine_halt or
or machine_power_off fails is more likely the correct
handling.  But we do have funny semantics sometimes.

I will see what I can do to expand my patchset to handle all of these
various callers of do_exit.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-28 17:14                                                 ` Michael Schmitz
@ 2021-06-28 19:17                                                   ` Geert Uytterhoeven
  2021-06-28 20:13                                                     ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Geert Uytterhoeven @ 2021-06-28 19:17 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch,
	Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook

Hi Michael,

On Mon, Jun 28, 2021 at 7:14 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
> Am 28.06.2021 um 19:31 schrieb Geert Uytterhoeven:
> >> Haven't found that warning in over 7 years' worth of console logs, and
> >> I'm a good candidate for running the oldest userland in existence for m68k.
> >>
> >> Time to let it go.
> >
> > The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
> > which is a very old ramdisk from right after the m68k a.out to ELF
> > transition:
> >
> >     warning: process `update' used the obsolete bdflush system call
> >     Fix your initscripts?
> >
> > I still boot it, once in a while.
>
> OK; you take the cake. That ramdisk came to mind when I thought about
> where I'd last seen bdflush, but I've not used it in ages (not sure 14
> MB are enough for that).

Of course it will work on your 14 MiB machine!  It fits on a floppy, _after_
decompression.  It was used by people to install Linux on the hard disks
of their beefy m68k machines, after they had set up the family Christmas
tree, in December 1996.

I also have a slightly larger one, built from OpenWRT when I did my first
experiments on that.  Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open
a shell on the serial console, so it is more useful to me.

> The question then is - will bdflush fail gracefully, or spin retrying
> the syscall?

Will add to my todo list...
BTW, you can boot this ramdisk on ARAnyM, too.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation
  2021-06-26  3:17                                           ` Kees Cook
@ 2021-06-28 19:21                                             ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-28 19:21 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo

Kees Cook <keescook@chromium.org> writes:

> On Thu, Jun 24, 2021 at 01:59:55PM -0500, Eric W. Biederman wrote:
>> 
>> Factor out force_sig_seccomp from the seccomp signal generation and
>> place it in kernel/signal.c.  The function force_sig_seccomp takes a
>> paramter force_coredump to indicate that the sigaction field should be
>> reset to SIGDFL so that a coredump will be generated when the signal
>> is delivered.
>
> Ah! This is the part I missed when I was originally trying to figure
> out the coredump stuff. It's the need for setting a default handler
> (i.e. doing a coredump)?

Yes.  If we don't force the handler to SIG_DFL someone might catch
SIGSYS.

>> force_sig_seccomp is then used to replace both seccomp_send_sigsys
>> and seccomp_init_siginfo.
>> 
>> force_sig_info_to_task gains an extra parameter to force using
>> the default signal action.
>> 
>> With this change seccomp is no longer a special case and there
>> becomes exactly one place do_coredump is called from.
>
> Looks good to me. This may benefit from force_sig_seccomp() to be wrapped
> in an #ifdef CONFIG_SECCOMP.

At which point Linus will probably be grumpy with me for introducing
#ifdefs.

I suspect seccomp at this point is sufficiently common that is probably
more productive to figure out how to remove #ifdef  CONFIG_SECCOMP.

> (This patch reminds me that the seccomp self tests don't check for core
> dumps...)

This patch is slightly wrong in that it kept the call to do_group_exit
when it can never be reached.

Eric



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads
  2021-06-26  3:42                                           ` Kees Cook
@ 2021-06-28 19:25                                             ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-28 19:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven,
	linux-m68k, Arnd Bergmann, Tejun Heo

Kees Cook <keescook@chromium.org> writes:

> On Thu, Jun 24, 2021 at 02:02:16PM -0500, Eric W. Biederman wrote:
>> 
>> Forld do_group_exit into get_signal as it is the last caller.
>> 
>> Move the group_exit logic above the PF_IO_WORKER exit, ensuring
>> that if an PF_IO_WORKER catches SIGKILL every thread in
>> the thread group will exit not just the the PF_IO_WORKER.
>> 
>> Now that the information is easily available only set PF_SIGNALED
>> when it was a signal that caused the exit.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  include/linux/sched/task.h |  1 -
>>  kernel/exit.c              | 31 -------------------------------
>>  kernel/signal.c            | 35 +++++++++++++++++++++++++----------
>>  3 files changed, 25 insertions(+), 42 deletions(-)
>> 
>> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
>> index ef02be869cf2..45525512e3d0 100644
>> --- a/include/linux/sched/task.h
>> +++ b/include/linux/sched/task.h
>> @@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk)
>>  {
>>  }
>>  #endif
>> -extern void do_group_exit(int);
>>  
>>  extern void exit_files(struct task_struct *);
>>  extern void exit_itimers(struct signal_struct *);
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 921519d80b56..635f434122b7 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code)
>>  	do_exit((error_code&0xff)<<8);
>>  }
>>  
>> -/*
>> - * Take down every thread in the group.  This is called by fatal signals
>> - * as well as by sys_exit_group (below).
>> - */
>> -void
>> -do_group_exit(int exit_code)
>> -{
>> -	struct signal_struct *sig = current->signal;
>> -
>> -	BUG_ON(exit_code & 0x80); /* core dumps don't get here */
>> -
>> -	if (signal_group_exit(sig))
>> -		exit_code = sig->group_exit_code;
>> -	else if (!thread_group_empty(current)) {
>> -		struct sighand_struct *const sighand = current->sighand;
>> -
>> -		spin_lock_irq(&sighand->siglock);
>> -		if (signal_group_exit(sig))
>> -			/* Another thread got here before we took the lock.  */
>> -			exit_code = sig->group_exit_code;
>> -		else {
>> -			sig->group_exit_code = exit_code;
>> -			sig->flags = SIGNAL_GROUP_EXIT;
>> -			zap_other_threads(current);
>
> Oh, now I see it: the "new code" in start_group_exit() is an open-coded
> zap_other_threads()? That wasn't clear to me, but makes sense now.

Pretty much.  I think zap_other_threads has actually muddied the waters
quite a bit by putting reuse in the wrong place.

>> -		}
>> -		spin_unlock_irq(&sighand->siglock);
>> -	}
>> -
>> -	do_exit(exit_code);
>> -	/* NOTREACHED */
>> -}
>>  
>>  /*
>>   * this kills every thread in the thread group. Note that any externally
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index c79c010ca5f3..95a076af600a 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig)
>>  {
>>  	struct sighand_struct *sighand = current->sighand;
>>  	struct signal_struct *signal = current->signal;
>> +	int exit_code;
>>  	int signr;
>>  
>>  	if (unlikely(current->task_works))
>> @@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig)
>>  		/*
>>  		 * Anything else is fatal, maybe with a core dump.
>>  		 */
>> -		current->flags |= PF_SIGNALED;
>> -
>>  		if (sig_kernel_coredump(signr)) {
>>  			if (print_fatal_signals)
>>  				print_fatal_signal(ksig->info.si_signo);
>> @@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig)
>>  			/*
>>  			 * If it was able to dump core, this kills all
>>  			 * other threads in the group and synchronizes with
>> -			 * their demise.  If we lost the race with another
>> -			 * thread getting here, it set group_exit_code
>> -			 * first and our do_group_exit call below will use
>> -			 * that value and ignore the one we pass it.
>> +			 * their demise.  If  another thread makes it
>> +			 * to do_coredump first, it will set group_exit_code
>> +			 * which will be passed to do_exit.
>>  			 */
>>  			do_coredump(&ksig->info);
>>  		}
>>  
>> +		/*
>> +		 * Death signals, no core dump.
>> +		 */
>> +		exit_code = signr;
>> +		if (signal_group_exit(signal)) {
>> +			exit_code = signal->group_exit_code;
>> +		} else {
>> +			spin_lock_irq(&sighand->siglock);
>> +			if (signal_group_exit(signal)) {
>> +				/* Another thread got here before we took the lock.  */
>> +				exit_code = signal->group_exit_code;
>> +			} else {
>> +				start_group_exit_locked(signal, exit_code);
>
> And here's the "if we didn't already do start_group_exit(), do it here".
> And that state is entirely captured via the SIGNAL_GROUP_EXIT flag.
> Cool.

Yes.  At least when the dust clears. 

>> +			}
>> +			spin_unlock_irq(&sighand->siglock);
>> +		}
>> +
>> +		if (exit_code & 0x7f)
>> +			current->flags |= PF_SIGNALED;
>> +
>>  		/*
>>  		 * PF_IO_WORKER threads will catch and exit on fatal signals
>>  		 * themselves. They have cleanup that must be performed, so
>> @@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig)
>>  		if (current->flags & PF_IO_WORKER)
>>  			goto out;
>>  
>> -		/*
>> -		 * Death signals, no core dump.
>> -		 */
>> -		do_group_exit(ksig->info.si_signo);
>> +		do_exit(exit_code);
>>  		/* NOTREACHED */
>>  	}
>>  	spin_unlock_irq(&sighand->siglock);
>> -- 
>> 2.20.1
>> 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-28 19:17                                                   ` Geert Uytterhoeven
@ 2021-06-28 20:13                                                     ` Michael Schmitz
  2021-06-28 21:18                                                       ` Geert Uytterhoeven
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-28 20:13 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch,
	Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook

Hi Geert,

On 29/06/21 7:17 am, Geert Uytterhoeven wrote:
>>> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
>>> which is a very old ramdisk from right after the m68k a.out to ELF
>>> transition:
>>>
>>>      warning: process `update' used the obsolete bdflush system call
>>>      Fix your initscripts?
>>>
>>> I still boot it, once in a while.
>> OK; you take the cake. That ramdisk came to mind when I thought about
>> where I'd last seen bdflush, but I've not used it in ages (not sure 14
>> MB are enough for that).
> Of course it will work on your 14 MiB machine!  It fits on a floppy, _after_
> decompression.  It was used by people to install Linux on the hard disks
> of their beefy m68k machines, after they had set up the family Christmas
> tree, in December 1996.
Been there, done that. Wrote the HOWTO for ext2 filesystem byte-swapping.
> I also have a slightly larger one, built from OpenWRT when I did my first
> experiments on that.  Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open
> a shell on the serial console, so it is more useful to me.
>
>> The question then is - will bdflush fail gracefully, or spin retrying
>> the syscall?
> Will add to my todo list...
> BTW, you can boot this ramdisk on ARAnyM, too.

True. I can't find that ramdisk image anywhere - if you can point me to 
some archive, I'll give that a try.

Cheers,

     Michael


>
> Gr{oetje,eeting}s,
>
>                          Geert
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-28 20:13                                                     ` Michael Schmitz
@ 2021-06-28 21:18                                                       ` Geert Uytterhoeven
  2021-06-28 23:42                                                         ` Michael Schmitz
  0 siblings, 1 reply; 119+ messages in thread
From: Geert Uytterhoeven @ 2021-06-28 21:18 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch,
	Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook

Hi Michael,

On Mon, Jun 28, 2021 at 10:13 PM Michael Schmitz <schmitzmic@gmail.com> wrote:
> On 29/06/21 7:17 am, Geert Uytterhoeven wrote:
> >>> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
> >>> which is a very old ramdisk from right after the m68k a.out to ELF
> >>> transition:
> >>>
> >>>      warning: process `update' used the obsolete bdflush system call
> >>>      Fix your initscripts?
> >>>
> >>> I still boot it, once in a while.
> >> OK; you take the cake. That ramdisk came to mind when I thought about
> >> where I'd last seen bdflush, but I've not used it in ages (not sure 14
> >> MB are enough for that).
> > Of course it will work on your 14 MiB machine!  It fits on a floppy, _after_
> > decompression.  It was used by people to install Linux on the hard disks
> > of their beefy m68k machines, after they had set up the family Christmas
> > tree, in December 1996.
>
> Been there, done that. Wrote the HOWTO for ext2 filesystem byte-swapping.

I knew I could revive your memory ;-)

> > I also have a slightly larger one, built from OpenWRT when I did my first
> > experiments on that.  Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open
> > a shell on the serial console, so it is more useful to me.
> >
> >> The question then is - will bdflush fail gracefully, or spin retrying
> >> the syscall?
> > Will add to my todo list...
> > BTW, you can boot this ramdisk on ARAnyM, too.
>
> True. I can't find that ramdisk image anywhere - if you can point me to
> some archive, I'll give that a try.

http://ftp.mac.linux-m68k.org/pub/linux-mac68k/initrd/

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH 0/9] Refactoring exit
  2021-06-28 21:18                                                       ` Geert Uytterhoeven
@ 2021-06-28 23:42                                                         ` Michael Schmitz
  2021-06-29 20:28                                                           ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Michael Schmitz @ 2021-06-28 23:42 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch,
	Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook

Hi Geert,

On 29/06/21 9:18 am, Geert Uytterhoeven wrote:
>
>>>> The question then is - will bdflush fail gracefully, or spin retrying
>>>> the syscall?
>>> Will add to my todo list...
>>> BTW, you can boot this ramdisk on ARAnyM, too.
>> True. I can't find that ramdisk image anywhere - if you can point me to
>> some archive, I'll give that a try.
> http://ftp.mac.linux-m68k.org/pub/linux-mac68k/initrd/

Thanks - removing the if (func==1) do_exit(0); part does give similar 
behaviour as before - kernel warns five times, then shuts up (without 
change, warns twice only, and /sbin/update no longer runs).

Removing the syscall from the m68k syscall table altogether still gives 
a working ramdisk. /sbin/update is still running, so evidently doesn't 
care about the invalid syscall result ...

Cheers,

     Michael


>
> Gr{oetje,eeting}s,
>
>                          Geert
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call
  2021-06-28 23:42                                                         ` Michael Schmitz
@ 2021-06-29 20:28                                                           ` Eric W. Biederman
  2021-06-29 21:45                                                             ` Michael Schmitz
                                                                               ` (3 more replies)
  0 siblings, 4 replies; 119+ messages in thread
From: Eric W. Biederman @ 2021-06-29 20:28 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: Geert Uytterhoeven, Al Viro, Linus Torvalds, linux-arch,
	Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, linux-api


The bdflush system call has been deprecated for a very long time.
Recently Michael Schmitz tested[1] and found that the last known
caller of of the bdflush system call is unaffected by it's removal.

Since the code is not needed delete it.

[1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

I think we have consensus that bdflush can be removed. Can folks please
verify I have removed it correctly?

Michael could you give me a Tested-by on this patch?

 arch/alpha/kernel/syscalls/syscall.tbl        |  2 +-
 arch/arm/tools/syscall.tbl                    |  2 +-
 arch/arm64/include/asm/unistd32.h             |  2 +-
 arch/ia64/kernel/syscalls/syscall.tbl         |  2 +-
 arch/m68k/kernel/syscalls/syscall.tbl         |  2 +-
 arch/microblaze/kernel/syscalls/syscall.tbl   |  2 +-
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  2 +-
 arch/parisc/kernel/syscalls/syscall.tbl       |  2 +-
 arch/powerpc/kernel/syscalls/syscall.tbl      |  2 +-
 arch/s390/kernel/syscalls/syscall.tbl         |  2 +-
 arch/sh/kernel/syscalls/syscall.tbl           |  2 +-
 arch/sparc/kernel/syscalls/syscall.tbl        |  2 +-
 arch/x86/entry/syscalls/syscall_32.tbl        |  2 +-
 arch/xtensa/kernel/syscalls/syscall.tbl       |  2 +-
 fs/buffer.c                                   | 27 -------------------
 include/linux/syscalls.h                      |  1 -
 include/uapi/linux/capability.h               |  1 -
 kernel/sys_ni.c                               |  1 -
 .../arch/powerpc/entry/syscalls/syscall.tbl   |  2 +-
 .../perf/arch/s390/entry/syscalls/syscall.tbl |  2 +-
 20 files changed, 16 insertions(+), 46 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..85d2bcd9cf36 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -230,7 +230,7 @@
 259	common	osf_swapctl			sys_ni_syscall
 260	common	osf_memcntl			sys_ni_syscall
 261	common	osf_fdatasync			sys_ni_syscall
-300	common	bdflush				sys_bdflush
+300	common	bdflush				sys_ni_syscall
 301	common	sethae				sys_sethae
 302	common	mount				sys_mount
 303	common	old_adjtimex			sys_old_adjtimex
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 28e03b5fec00..241988512648 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -147,7 +147,7 @@
 131	common	quotactl		sys_quotactl
 132	common	getpgid			sys_getpgid
 133	common	fchdir			sys_fchdir
-134	common	bdflush			sys_bdflush
+134	common	bdflush			sys_ni_syscall
 135	common	sysfs			sys_sysfs
 136	common	personality		sys_personality
 # 137 was sys_afs_syscall
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 5dab69d2c22b..a35cd6c4909c 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -279,7 +279,7 @@ __SYSCALL(__NR_getpgid, sys_getpgid)
 #define __NR_fchdir 133
 __SYSCALL(__NR_fchdir, sys_fchdir)
 #define __NR_bdflush 134
-__SYSCALL(__NR_bdflush, sys_bdflush)
+__SYSCALL(__NR_bdflush, sys_ni_syscall)
 #define __NR_sysfs 135
 __SYSCALL(__NR_sysfs, sys_sysfs)
 #define __NR_personality 136
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index bb11fe4c875a..7de53a9a2972 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -123,7 +123,7 @@
 # 1135 was get_kernel_syms
 # 1136 was query_module
 113	common	quotactl			sys_quotactl
-114	common	bdflush				sys_bdflush
+114	common	bdflush				sys_ni_syscall
 115	common	sysfs				sys_sysfs
 116	common	personality			sys_personality
 117	common	afs_syscall			sys_ni_syscall
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 79c2d24c89dd..be5abd9c8c07 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
 131	common	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+134	common	bdflush				sys_ni_syscall
 135	common	sysfs				sys_sysfs
 136	common	personality			sys_personality
 # 137 was afs_syscall
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index b11395a20c20..555fd987f4ab 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
 131	common	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+134	common	bdflush				sys_ni_syscall
 135	common	sysfs				sys_sysfs
 136	common	personality			sys_personality
 137	common	afs_syscall			sys_ni_syscall
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index d560c467a8c6..2c6b10db3bd5 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -145,7 +145,7 @@
 131	o32	quotactl			sys_quotactl
 132	o32	getpgid				sys_getpgid
 133	o32	fchdir				sys_fchdir
-134	o32	bdflush				sys_bdflush
+134	o32	bdflush				sys_ni_syscall
 135	o32	sysfs				sys_sysfs
 136	o32	personality			sys_personality			sys_32_personality
 137	o32	afs_syscall			sys_ni_syscall
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index aabc37f8cae3..51c156cb00f1 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -147,7 +147,7 @@
 131	common	quotactl		sys_quotactl
 132	common	getpgid			sys_getpgid
 133	common	fchdir			sys_fchdir
-134	common	bdflush			sys_bdflush
+134	common	bdflush			sys_ni_syscall
 135	common	sysfs			sys_sysfs
 136	32	personality		parisc_personality
 136	64	personality		sys_personality
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 8f052ff4058c..2518e4e6dccf 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -176,7 +176,7 @@
 131	nospu	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+134	common	bdflush				sys_ni_syscall
 135	common	sysfs				sys_sysfs
 136	32	personality			sys_personality			ppc64_personality
 136	64	personality			ppc64_personality
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 0690263df1dd..ffcf03714f12 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -122,7 +122,7 @@
 131  common	quotactl		sys_quotactl			sys_quotactl
 132  common	getpgid			sys_getpgid			sys_getpgid
 133  common	fchdir			sys_fchdir			sys_fchdir
-134  common	bdflush			sys_bdflush			sys_bdflush
+134  common	bdflush			sys_ni_syscall			sys_ni_syscall
 135  common	sysfs			sys_sysfs			sys_sysfs
 136  common	personality		sys_s390_personality		sys_s390_personality
 137  common	afs_syscall		-				-
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 0b91499ebdcf..6e7305066a70 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
 131	common	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+134	common	bdflush				sys_ni_syscall
 135	common	sysfs				sys_sysfs
 136	common	personality			sys_personality
 # 137 was afs_syscall
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e34cc30ef22c..bf330dda7c61 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -270,7 +270,7 @@
 222	common	delete_module		sys_delete_module
 223	common	get_kernel_syms		sys_ni_syscall
 224	common	getpgid			sys_getpgid
-225	common	bdflush			sys_bdflush
+225	common	bdflush			sys_ni_syscall
 226	common	sysfs			sys_sysfs
 227	common	afs_syscall		sys_nis_syscall
 228	common	setfsuid		sys_setfsuid16
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4bbc267fb36b..a21a72763d58 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -145,7 +145,7 @@
 131	i386	quotactl		sys_quotactl
 132	i386	getpgid			sys_getpgid
 133	i386	fchdir			sys_fchdir
-134	i386	bdflush			sys_bdflush
+134	i386	bdflush			sys_ni_syscall
 135	i386	sysfs			sys_sysfs
 136	i386	personality		sys_personality
 137	i386	afs_syscall
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index fd2f30227d96..db4e3d09b249 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -223,7 +223,7 @@
 # 205 was old nfsservctl
 205	common	nfsservctl			sys_ni_syscall
 206	common	_sysctl				sys_ni_syscall
-207	common	bdflush				sys_bdflush
+207	common	bdflush				sys_ni_syscall
 208	common	uname				sys_newuname
 209	common	sysinfo				sys_sysinfo
 210	common	init_module			sys_init_module
diff --git a/fs/buffer.c b/fs/buffer.c
index ea48c01fb76b..04ddff76c860 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3292,33 +3292,6 @@ int try_to_free_buffers(struct page *page)
 }
 EXPORT_SYMBOL(try_to_free_buffers);
 
-/*
- * There are no bdflush tunables left.  But distributions are
- * still running obsolete flush daemons, so we terminate them here.
- *
- * Use of bdflush() is deprecated and will be removed in a future kernel.
- * The `flush-X' kernel threads fully replace bdflush daemons and this call.
- */
-SYSCALL_DEFINE2(bdflush, int, func, long, data)
-{
-	static int msg_count;
-
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
-	if (msg_count < 5) {
-		msg_count++;
-		printk(KERN_INFO
-			"warning: process `%s' used the obsolete bdflush"
-			" system call\n", current->comm);
-		printk(KERN_INFO "Fix your initscripts?\n");
-	}
-
-	if (func == 1)
-		do_exit(0);
-	return 0;
-}
-
 /*
  * Buffer-head allocation
  */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..1bd6e05ea116 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1157,7 +1157,6 @@ asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
 asmlinkage long sys_vfork(void);
 asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
 asmlinkage long sys_send(int, void __user *, size_t, unsigned);
-asmlinkage long sys_bdflush(int func, long data);
 asmlinkage long sys_oldumount(char __user *name);
 asmlinkage long sys_uselib(const char __user *library);
 asmlinkage long sys_sysfs(int option,
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 2ddb4226cd23..463d1ba2232a 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -243,7 +243,6 @@ struct vfs_ns_cap_data {
 /* Allow examination and configuration of disk quotas */
 /* Allow setting the domainname */
 /* Allow setting the hostname */
-/* Allow calling bdflush() */
 /* Allow mount() and umount(), setting up new smb connection */
 /* Allow some autofs root ioctls */
 /* Allow nfsservctl */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..adf4d66ffae2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -414,7 +414,6 @@ COND_SYSCALL(epoll_wait);
 COND_SYSCALL(recv);
 COND_SYSCALL_COMPAT(recv);
 COND_SYSCALL(send);
-COND_SYSCALL(bdflush);
 COND_SYSCALL(uselib);
 
 /* optional: time32 */
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index 2e68fbb57cc6..ab72dec9dadb 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -176,7 +176,7 @@
 131	nospu	quotactl			sys_quotactl
 132	common	getpgid				sys_getpgid
 133	common	fchdir				sys_fchdir
-134	common	bdflush				sys_bdflush
+134	common	bdflush				sys_ni_syscall
 135	common	sysfs				sys_sysfs
 136	32	personality			sys_personality			ppc64_personality
 136	64	personality			ppc64_personality
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index 7e4a2aba366d..f2eba775e676 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -122,7 +122,7 @@
 131  common	quotactl		sys_quotactl			sys_quotactl
 132  common	getpgid			sys_getpgid			sys_getpgid
 133  common	fchdir			sys_fchdir			sys_fchdir
-134  common	bdflush			sys_bdflush			sys_bdflush
+134  common	bdflush			-				-
 135  common	sysfs			sys_sysfs			sys_sysfs
 136  common	personality		sys_s390_personality		sys_s390_personality
 137  common	afs_syscall		-				-
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call
  2021-06-29 20:28                                                           ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman
@ 2021-06-29 21:45                                                             ` Michael Schmitz
  2021-06-30  8:24                                                             ` Geert Uytterhoeven
                                                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 119+ messages in thread
From: Michael Schmitz @ 2021-06-29 21:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Geert Uytterhoeven, Al Viro, Linus Torvalds, linux-arch,
	Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, linux-api


On 30/06/21 8:28 am, Eric W. Biederman wrote:
> The bdflush system call has been deprecated for a very long time.
> Recently Michael Schmitz tested[1] and found that the last known
> caller of of the bdflush system call is unaffected by it's removal.
>
> Since the code is not needed delete it.
>
> [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Michael Schmitz <schmitzmic@gmail.com>
> ---
>
> I think we have consensus that bdflush can be removed. Can folks please
> verify I have removed it correctly?
>
> Michael could you give me a Tested-by on this patch?
>
>   arch/alpha/kernel/syscalls/syscall.tbl        |  2 +-
>   arch/arm/tools/syscall.tbl                    |  2 +-
>   arch/arm64/include/asm/unistd32.h             |  2 +-
>   arch/ia64/kernel/syscalls/syscall.tbl         |  2 +-
>   arch/m68k/kernel/syscalls/syscall.tbl         |  2 +-
>   arch/microblaze/kernel/syscalls/syscall.tbl   |  2 +-
>   arch/mips/kernel/syscalls/syscall_o32.tbl     |  2 +-
>   arch/parisc/kernel/syscalls/syscall.tbl       |  2 +-
>   arch/powerpc/kernel/syscalls/syscall.tbl      |  2 +-
>   arch/s390/kernel/syscalls/syscall.tbl         |  2 +-
>   arch/sh/kernel/syscalls/syscall.tbl           |  2 +-
>   arch/sparc/kernel/syscalls/syscall.tbl        |  2 +-
>   arch/x86/entry/syscalls/syscall_32.tbl        |  2 +-
>   arch/xtensa/kernel/syscalls/syscall.tbl       |  2 +-
>   fs/buffer.c                                   | 27 -------------------
>   include/linux/syscalls.h                      |  1 -
>   include/uapi/linux/capability.h               |  1 -
>   kernel/sys_ni.c                               |  1 -
>   .../arch/powerpc/entry/syscalls/syscall.tbl   |  2 +-
>   .../perf/arch/s390/entry/syscalls/syscall.tbl |  2 +-
>   20 files changed, 16 insertions(+), 46 deletions(-)
>
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 3000a2e8ee21..85d2bcd9cf36 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -230,7 +230,7 @@
>   259	common	osf_swapctl			sys_ni_syscall
>   260	common	osf_memcntl			sys_ni_syscall
>   261	common	osf_fdatasync			sys_ni_syscall
> -300	common	bdflush				sys_bdflush
> +300	common	bdflush				sys_ni_syscall
>   301	common	sethae				sys_sethae
>   302	common	mount				sys_mount
>   303	common	old_adjtimex			sys_old_adjtimex
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 28e03b5fec00..241988512648 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -147,7 +147,7 @@
>   131	common	quotactl		sys_quotactl
>   132	common	getpgid			sys_getpgid
>   133	common	fchdir			sys_fchdir
> -134	common	bdflush			sys_bdflush
> +134	common	bdflush			sys_ni_syscall
>   135	common	sysfs			sys_sysfs
>   136	common	personality		sys_personality
>   # 137 was sys_afs_syscall
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 5dab69d2c22b..a35cd6c4909c 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -279,7 +279,7 @@ __SYSCALL(__NR_getpgid, sys_getpgid)
>   #define __NR_fchdir 133
>   __SYSCALL(__NR_fchdir, sys_fchdir)
>   #define __NR_bdflush 134
> -__SYSCALL(__NR_bdflush, sys_bdflush)
> +__SYSCALL(__NR_bdflush, sys_ni_syscall)
>   #define __NR_sysfs 135
>   __SYSCALL(__NR_sysfs, sys_sysfs)
>   #define __NR_personality 136
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index bb11fe4c875a..7de53a9a2972 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -123,7 +123,7 @@
>   # 1135 was get_kernel_syms
>   # 1136 was query_module
>   113	common	quotactl			sys_quotactl
> -114	common	bdflush				sys_bdflush
> +114	common	bdflush				sys_ni_syscall
>   115	common	sysfs				sys_sysfs
>   116	common	personality			sys_personality
>   117	common	afs_syscall			sys_ni_syscall
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 79c2d24c89dd..be5abd9c8c07 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -141,7 +141,7 @@
>   131	common	quotactl			sys_quotactl
>   132	common	getpgid				sys_getpgid
>   133	common	fchdir				sys_fchdir
> -134	common	bdflush				sys_bdflush
> +134	common	bdflush				sys_ni_syscall
>   135	common	sysfs				sys_sysfs
>   136	common	personality			sys_personality
>   # 137 was afs_syscall
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index b11395a20c20..555fd987f4ab 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -141,7 +141,7 @@
>   131	common	quotactl			sys_quotactl
>   132	common	getpgid				sys_getpgid
>   133	common	fchdir				sys_fchdir
> -134	common	bdflush				sys_bdflush
> +134	common	bdflush				sys_ni_syscall
>   135	common	sysfs				sys_sysfs
>   136	common	personality			sys_personality
>   137	common	afs_syscall			sys_ni_syscall
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index d560c467a8c6..2c6b10db3bd5 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -145,7 +145,7 @@
>   131	o32	quotactl			sys_quotactl
>   132	o32	getpgid				sys_getpgid
>   133	o32	fchdir				sys_fchdir
> -134	o32	bdflush				sys_bdflush
> +134	o32	bdflush				sys_ni_syscall
>   135	o32	sysfs				sys_sysfs
>   136	o32	personality			sys_personality			sys_32_personality
>   137	o32	afs_syscall			sys_ni_syscall
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index aabc37f8cae3..51c156cb00f1 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -147,7 +147,7 @@
>   131	common	quotactl		sys_quotactl
>   132	common	getpgid			sys_getpgid
>   133	common	fchdir			sys_fchdir
> -134	common	bdflush			sys_bdflush
> +134	common	bdflush			sys_ni_syscall
>   135	common	sysfs			sys_sysfs
>   136	32	personality		parisc_personality
>   136	64	personality		sys_personality
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 8f052ff4058c..2518e4e6dccf 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -176,7 +176,7 @@
>   131	nospu	quotactl			sys_quotactl
>   132	common	getpgid				sys_getpgid
>   133	common	fchdir				sys_fchdir
> -134	common	bdflush				sys_bdflush
> +134	common	bdflush				sys_ni_syscall
>   135	common	sysfs				sys_sysfs
>   136	32	personality			sys_personality			ppc64_personality
>   136	64	personality			ppc64_personality
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 0690263df1dd..ffcf03714f12 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -122,7 +122,7 @@
>   131  common	quotactl		sys_quotactl			sys_quotactl
>   132  common	getpgid			sys_getpgid			sys_getpgid
>   133  common	fchdir			sys_fchdir			sys_fchdir
> -134  common	bdflush			sys_bdflush			sys_bdflush
> +134  common	bdflush			sys_ni_syscall			sys_ni_syscall
>   135  common	sysfs			sys_sysfs			sys_sysfs
>   136  common	personality		sys_s390_personality		sys_s390_personality
>   137  common	afs_syscall		-				-
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 0b91499ebdcf..6e7305066a70 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -141,7 +141,7 @@
>   131	common	quotactl			sys_quotactl
>   132	common	getpgid				sys_getpgid
>   133	common	fchdir				sys_fchdir
> -134	common	bdflush				sys_bdflush
> +134	common	bdflush				sys_ni_syscall
>   135	common	sysfs				sys_sysfs
>   136	common	personality			sys_personality
>   # 137 was afs_syscall
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index e34cc30ef22c..bf330dda7c61 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -270,7 +270,7 @@
>   222	common	delete_module		sys_delete_module
>   223	common	get_kernel_syms		sys_ni_syscall
>   224	common	getpgid			sys_getpgid
> -225	common	bdflush			sys_bdflush
> +225	common	bdflush			sys_ni_syscall
>   226	common	sysfs			sys_sysfs
>   227	common	afs_syscall		sys_nis_syscall
>   228	common	setfsuid		sys_setfsuid16
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4bbc267fb36b..a21a72763d58 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -145,7 +145,7 @@
>   131	i386	quotactl		sys_quotactl
>   132	i386	getpgid			sys_getpgid
>   133	i386	fchdir			sys_fchdir
> -134	i386	bdflush			sys_bdflush
> +134	i386	bdflush			sys_ni_syscall
>   135	i386	sysfs			sys_sysfs
>   136	i386	personality		sys_personality
>   137	i386	afs_syscall
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index fd2f30227d96..db4e3d09b249 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -223,7 +223,7 @@
>   # 205 was old nfsservctl
>   205	common	nfsservctl			sys_ni_syscall
>   206	common	_sysctl				sys_ni_syscall
> -207	common	bdflush				sys_bdflush
> +207	common	bdflush				sys_ni_syscall
>   208	common	uname				sys_newuname
>   209	common	sysinfo				sys_sysinfo
>   210	common	init_module			sys_init_module
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ea48c01fb76b..04ddff76c860 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3292,33 +3292,6 @@ int try_to_free_buffers(struct page *page)
>   }
>   EXPORT_SYMBOL(try_to_free_buffers);
>   
> -/*
> - * There are no bdflush tunables left.  But distributions are
> - * still running obsolete flush daemons, so we terminate them here.
> - *
> - * Use of bdflush() is deprecated and will be removed in a future kernel.
> - * The `flush-X' kernel threads fully replace bdflush daemons and this call.
> - */
> -SYSCALL_DEFINE2(bdflush, int, func, long, data)
> -{
> -	static int msg_count;
> -
> -	if (!capable(CAP_SYS_ADMIN))
> -		return -EPERM;
> -
> -	if (msg_count < 5) {
> -		msg_count++;
> -		printk(KERN_INFO
> -			"warning: process `%s' used the obsolete bdflush"
> -			" system call\n", current->comm);
> -		printk(KERN_INFO "Fix your initscripts?\n");
> -	}
> -
> -	if (func == 1)
> -		do_exit(0);
> -	return 0;
> -}
> -
>   /*
>    * Buffer-head allocation
>    */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 050511e8f1f8..1bd6e05ea116 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1157,7 +1157,6 @@ asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
>   asmlinkage long sys_vfork(void);
>   asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
>   asmlinkage long sys_send(int, void __user *, size_t, unsigned);
> -asmlinkage long sys_bdflush(int func, long data);
>   asmlinkage long sys_oldumount(char __user *name);
>   asmlinkage long sys_uselib(const char __user *library);
>   asmlinkage long sys_sysfs(int option,
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 2ddb4226cd23..463d1ba2232a 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -243,7 +243,6 @@ struct vfs_ns_cap_data {
>   /* Allow examination and configuration of disk quotas */
>   /* Allow setting the domainname */
>   /* Allow setting the hostname */
> -/* Allow calling bdflush() */
>   /* Allow mount() and umount(), setting up new smb connection */
>   /* Allow some autofs root ioctls */
>   /* Allow nfsservctl */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 0ea8128468c3..adf4d66ffae2 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -414,7 +414,6 @@ COND_SYSCALL(epoll_wait);
>   COND_SYSCALL(recv);
>   COND_SYSCALL_COMPAT(recv);
>   COND_SYSCALL(send);
> -COND_SYSCALL(bdflush);
>   COND_SYSCALL(uselib);
>   
>   /* optional: time32 */
> diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> index 2e68fbb57cc6..ab72dec9dadb 100644
> --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> @@ -176,7 +176,7 @@
>   131	nospu	quotactl			sys_quotactl
>   132	common	getpgid				sys_getpgid
>   133	common	fchdir				sys_fchdir
> -134	common	bdflush				sys_bdflush
> +134	common	bdflush				sys_ni_syscall
>   135	common	sysfs				sys_sysfs
>   136	32	personality			sys_personality			ppc64_personality
>   136	64	personality			ppc64_personality
> diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> index 7e4a2aba366d..f2eba775e676 100644
> --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> @@ -122,7 +122,7 @@
>   131  common	quotactl		sys_quotactl			sys_quotactl
>   132  common	getpgid			sys_getpgid			sys_getpgid
>   133  common	fchdir			sys_fchdir			sys_fchdir
> -134  common	bdflush			sys_bdflush			sys_bdflush
> +134  common	bdflush			-				-
>   135  common	sysfs			sys_sysfs			sys_sysfs
>   136  common	personality		sys_s390_personality		sys_s390_personality
>   137  common	afs_syscall		-				-

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call
  2021-06-29 20:28                                                           ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman
  2021-06-29 21:45                                                             ` Michael Schmitz
@ 2021-06-30  8:24                                                             ` Geert Uytterhoeven
  2021-06-30  8:37                                                             ` Arnd Bergmann
  2021-06-30 12:30                                                             ` Cyril Hrubis
  3 siblings, 0 replies; 119+ messages in thread
From: Geert Uytterhoeven @ 2021-06-30  8:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, Al Viro, Linus Torvalds, linux-arch, Jens Axboe,
	Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann,
	Tejun Heo, Kees Cook, Linux API

On Tue, Jun 29, 2021 at 10:28 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
> The bdflush system call has been deprecated for a very long time.
> Recently Michael Schmitz tested[1] and found that the last known
> caller of of the bdflush system call is unaffected by it's removal.
>
> Since the code is not needed delete it.
>
> [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

>  arch/m68k/kernel/syscalls/syscall.tbl         |  2 +-

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call
  2021-06-29 20:28                                                           ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman
  2021-06-29 21:45                                                             ` Michael Schmitz
  2021-06-30  8:24                                                             ` Geert Uytterhoeven
@ 2021-06-30  8:37                                                             ` Arnd Bergmann
  2021-06-30 12:30                                                             ` Cyril Hrubis
  3 siblings, 0 replies; 119+ messages in thread
From: Arnd Bergmann @ 2021-06-30  8:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, Geert Uytterhoeven, Al Viro, Linus Torvalds,
	linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Tejun Heo, Kees Cook, Linux API

On Tue, Jun 29, 2021 at 10:28 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
>
> The bdflush system call has been deprecated for a very long time.
> Recently Michael Schmitz tested[1] and found that the last known
> caller of of the bdflush system call is unaffected by it's removal.
>
> Since the code is not needed delete it.
>
> [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>
> I think we have consensus that bdflush can be removed. Can folks please
> verify I have removed it correctly?

Reviewed-by: Arnd Bergmann <arnd@arndb.de>

We are traditionally somewhat inconsistent about whether to leave the
__NR_bdflush macro present in asm/unistd.h or to remove it when the
syscall is gone. Leaving it in place as you do is probably better here.

        Arnd

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call
  2021-06-29 20:28                                                           ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman
                                                                               ` (2 preceding siblings ...)
  2021-06-30  8:37                                                             ` Arnd Bergmann
@ 2021-06-30 12:30                                                             ` Cyril Hrubis
  3 siblings, 0 replies; 119+ messages in thread
From: Cyril Hrubis @ 2021-06-30 12:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael Schmitz, Geert Uytterhoeven, Al Viro, Linus Torvalds,
	linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha,
	linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, linux-api

Hi!
I've send a similar patch [1] a while ago when I removed bdflush tests from
LTP.

[1] https://lore.kernel.org/lkml/20190528101012.11402-1-chrubis@suse.cz/

Acked-by: Cyril Hrubis <chrubis@suse.cz>

-- 
Cyril Hrubis
chrubis@suse.cz

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2021-06-30 12:56 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-10 20:57 Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Eric W. Biederman
2021-06-10 22:04 ` Linus Torvalds
2021-06-11 21:39   ` Eric W. Biederman
2021-06-11 23:26     ` Linus Torvalds
2021-06-13 21:54       ` Eric W. Biederman
2021-06-13 22:18         ` Linus Torvalds
2021-06-14  2:05           ` Michael Schmitz
2021-06-14  5:03             ` Michael Schmitz
2021-06-14 16:26               ` Eric W. Biederman
2021-06-14 22:26                 ` Michael Schmitz
2021-06-15 19:30                   ` Eric W. Biederman
2021-06-15 19:36                     ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman
2021-06-15 22:02                       ` Linus Torvalds
2021-06-16 16:32                         ` Eric W. Biederman
2021-06-16 18:29                           ` [PATCH 0/2] alpha/ptrace: Improved switch_stack handling Eric W. Biederman
2021-06-16 18:31                             ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman
2021-06-16 20:00                               ` Linus Torvalds
2021-06-16 20:37                                 ` Linus Torvalds
2021-06-16 20:57                                   ` Eric W. Biederman
2021-06-16 21:02                                     ` Al Viro
2021-06-16 21:08                                     ` Linus Torvalds
2021-06-16 20:42                                 ` Eric W. Biederman
2021-06-16 20:17                               ` Al Viro
2021-06-21  2:01                               ` Michael Schmitz
2021-06-21  2:17                                 ` Linus Torvalds
2021-06-21  3:18                                   ` Michael Schmitz
2021-06-21  3:37                                     ` Linus Torvalds
2021-06-21  4:08                                       ` Michael Schmitz
2021-06-21  3:44                                     ` Al Viro
2021-06-21  5:31                                       ` Michael Schmitz
2021-06-21  2:27                                 ` Al Viro
2021-06-21  3:36                                   ` Michael Schmitz
2021-06-16 18:32                             ` [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames Eric W. Biederman
2021-06-16 20:25                               ` Al Viro
2021-06-16 20:28                                 ` Al Viro
2021-06-16 20:49                                   ` Eric W. Biederman
2021-06-16 20:54                                     ` Al Viro
2021-06-16 20:47                                 ` Eric W. Biederman
2021-06-16 20:55                                   ` Al Viro
2021-06-16 20:50                       ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Al Viro
2021-06-15 20:56                     ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Michael Schmitz
2021-06-16  0:23                       ` Finn Thain
2021-06-15 21:58                     ` Linus Torvalds
2021-06-16 15:06                       ` Eric W. Biederman
2021-06-21 13:54                       ` Al Viro
2021-06-21 14:16                         ` Al Viro
2021-06-21 16:50                           ` Eric W. Biederman
2021-06-21 23:05                             ` Al Viro
2021-06-22 16:39                               ` Eric W. Biederman
2021-06-21 15:38                         ` Linus Torvalds
2021-06-21 18:59                         ` Al Viro
2021-06-21 19:22                           ` Linus Torvalds
2021-06-21 19:45                             ` Al Viro
2021-06-21 23:14                               ` Linus Torvalds
2021-06-21 23:23                                 ` Al Viro
2021-06-21 23:36                                   ` Linus Torvalds
2021-06-22 21:02                                     ` Eric W. Biederman
2021-06-22 21:48                                       ` Michael Schmitz
2021-06-23  5:26                                         ` Michael Schmitz
2021-06-23 14:36                                           ` Eric W. Biederman
2021-06-22  0:01                                 ` Michael Schmitz
2021-06-22 20:04                                 ` Michael Schmitz
2021-06-22 20:18                                   ` Al Viro
2021-06-22 21:57                                     ` Michael Schmitz
2021-06-21 20:03                             ` Eric W. Biederman
2021-06-21 23:15                               ` Linus Torvalds
2021-06-22 20:52                                 ` Eric W. Biederman
2021-06-23  0:41                                   ` Linus Torvalds
2021-06-23 14:33                                     ` Eric W. Biederman
2021-06-24 18:57                                       ` [PATCH 0/9] Refactoring exit Eric W. Biederman
2021-06-24 18:59                                         ` [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) Eric W. Biederman
2021-06-24 18:59                                         ` [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation Eric W. Biederman
2021-06-26  3:17                                           ` Kees Cook
2021-06-28 19:21                                             ` Eric W. Biederman
2021-06-24 19:00                                         ` [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread Eric W. Biederman
2021-06-26  3:20                                           ` Kees Cook
2021-06-24 19:01                                         ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman
2021-06-24 20:04                                           ` Linus Torvalds
2021-06-26  3:24                                           ` Kees Cook
2021-06-24 19:01                                         ` [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit Eric W. Biederman
2021-06-26  3:35                                           ` Kees Cook
2021-06-24 19:02                                         ` [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads Eric W. Biederman
2021-06-26  3:42                                           ` Kees Cook
2021-06-28 19:25                                             ` Eric W. Biederman
2021-06-24 19:02                                         ` [PATCH 7/9] signal: Make individual tasks exiting a first class concept Eric W. Biederman
2021-06-24 20:11                                           ` Linus Torvalds
2021-06-24 21:37                                             ` Eric W. Biederman
2021-06-24 19:03                                         ` [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit Eric W. Biederman
2021-06-26  5:56                                           ` Kees Cook
2021-06-24 19:03                                         ` [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal Eric W. Biederman
2021-06-24 22:45                                         ` [PATCH 0/9] Refactoring exit Al Viro
2021-06-27 22:13                                           ` Al Viro
2021-06-27 22:59                                             ` Michael Schmitz
2021-06-28  7:31                                               ` Geert Uytterhoeven
2021-06-28 16:20                                                 ` Eric W. Biederman
2021-06-28 17:14                                                 ` Michael Schmitz
2021-06-28 19:17                                                   ` Geert Uytterhoeven
2021-06-28 20:13                                                     ` Michael Schmitz
2021-06-28 21:18                                                       ` Geert Uytterhoeven
2021-06-28 23:42                                                         ` Michael Schmitz
2021-06-29 20:28                                                           ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman
2021-06-29 21:45                                                             ` Michael Schmitz
2021-06-30  8:24                                                             ` Geert Uytterhoeven
2021-06-30  8:37                                                             ` Arnd Bergmann
2021-06-30 12:30                                                             ` Cyril Hrubis
2021-06-28 19:02                                           ` [PATCH 0/9] Refactoring exit Eric W. Biederman
2021-06-21 19:24                           ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Al Viro
2021-06-21 23:24                             ` Michael Schmitz
2021-06-16  7:38                     ` Geert Uytterhoeven
2021-06-16 19:40                       ` Michael Schmitz
2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz
2021-06-13 19:59   ` Linus Torvalds
2021-06-13 20:07     ` Michael Schmitz
2021-06-13 20:26       ` Linus Torvalds
2021-06-13 20:33         ` Linus Torvalds
2021-06-13 20:47         ` Linus Torvalds
2021-06-14  7:13   ` Michael Schmitz
2021-06-14  7:40     ` Andreas Schwab
2021-06-14  8:19       ` Michael Schmitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).