* Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads @ 2021-06-10 20:57 Eric W. Biederman 2021-06-10 22:04 ` Linus Torvalds 2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz 0 siblings, 2 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-10 20:57 UTC (permalink / raw) To: linux-arch Cc: Jens Axboe, Oleg Nesterov, Al Viro, Linus Torvalds, linux-kernel, Richard Henderson, Ivan Kokshaysky, Matt Turner, linux-alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz, Kees Cook Folks, Digging through the guts of exit I found something I am not quite certain what to do with. On some architectures such as alpha, m68k, and nios2 the kernel calls into system calls with a subset of the registers saved on the kernel stack, and the kernel calls into signal handling and a few other contexts with all of the registers saved on the kernel stack. The problem is sometimes we read all of the registers from a context where they are not all saved. When this was initially observed it looked just like a coredump problem and it could be solved by tweaking the coredump code. That change was 77f6ab8b7768 ("don't dump the threads that had been already exiting when zapped.") However I have looked farther and we have the location where get_signal is called from io_uring, and we have the ptrace_stop in PTRACE_EVENT_EXIT. In PTRACE_EVENT_EXIT we could be called from exit(2) which is a syscall and we definitely won't have everything saved on the kernel stack. I have not doubled checked create_io_thread but I don't think create_io_threads saves all of the registers on the kernel stack. I think at this point we need to say that the architectures that have a do this need to be fixed to at least call do_exit and the kernel function in create_io_thread with the deeper stack. Is that reasonable of me to ask? Is there some other way to deal with this issue that I am not seeing? Am I missing some critical detail that makes PTRACE_EVENT_EXIT in do_exit not a problem if someone reads the register with ptrace? Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-10 20:57 Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Eric W. Biederman @ 2021-06-10 22:04 ` Linus Torvalds 2021-06-11 21:39 ` Eric W. Biederman 2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz 1 sibling, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-10 22:04 UTC (permalink / raw) To: Eric W. Biederman Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz, Kees Cook On Thu, Jun 10, 2021 at 1:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > The problem is sometimes we read all of the registers from > a context where they are not all saved. Ouch. Yes. And this is really painful because none of the *normal* architectures do this, so it gets absolutely no coverage. > I think at this point we need to say that the architectures that have a > do this need to be fixed to at least call do_exit and the kernel > function in create_io_thread with the deeper stack. Yeah. We traditionally have that requirement for fork() and friends too (vfork/clone), so adding exit and io_uring to do so seems like the most straightforward thing. But I really wish we had some way to test and trigger this so that we wouldn't get caught on this before. Something in task_pt_regs() that catches "this doesn't actually work" and does a WARN_ON_ONCE() on the affected architectures? Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-10 22:04 ` Linus Torvalds @ 2021-06-11 21:39 ` Eric W. Biederman 2021-06-11 23:26 ` Linus Torvalds 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-11 21:39 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Jun 10, 2021 at 1:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> The problem is sometimes we read all of the registers from >> a context where they are not all saved. > > Ouch. Yes. And this is really painful because none of the *normal* > architectures do this, so it gets absolutely no coverage. > >> I think at this point we need to say that the architectures that have a >> do this need to be fixed to at least call do_exit and the kernel >> function in create_io_thread with the deeper stack. > > Yeah. We traditionally have that requirement for fork() and friends > too (vfork/clone), so adding exit and io_uring to do so seems like the > most straightforward thing. Interesting. I am starting with Al's analysis and reading the code to see if I can understand what is going on. So I am still glossing over a few details as I dig into this. Kernel threads not having all of their registers saved is one of those details. Looking at copy_thread it looks like at least on alpha we are dealing with a structure that defines all of the registers in copy_thread. So perhaps all of the registers are there in kernel_threads already. I don't read alpha assembly very well and fork is a bit subtle. I don't know which piece of code is calling ret_from_fork/ret_from_kernel_thread. I really suspect that all of those registers are popped so at least for IO_THREADS we need to push them again, in a way that signal_pt_regs() can find them. It looks like we just need something like this to cover the userspace side of exit. diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S index e227f3a29a43..ab0dcb545bd1 100644 --- a/arch/alpha/kernel/entry.S +++ b/arch/alpha/kernel/entry.S @@ -812,6 +812,22 @@ fork_like fork fork_like vfork fork_like clone +.macro exit_like name + .align 4 + .globl alpha_\name + .ent alpha_\name +alpha_\name: + .prologue 0 + DO_SWITCH_STACK + jsr $26, sys_\name + UNDO_SWITCH_STACK + ret +.end alpha_\name +.endm + +exit_like exit +exit_like exit_group + .macro sigreturn_like name .align 4 .globl sys_\name diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 3000a2e8ee21..b9d6449d6caa 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -8,7 +8,7 @@ # The <abi> is always "common" for this file # 0 common osf_syscall alpha_syscall_zero -1 common exit sys_exit +1 common exit alpha_exit 2 common fork alpha_fork 3 common read sys_read 4 common write sys_write @@ -333,7 +333,7 @@ 400 common io_getevents sys_io_getevents 401 common io_submit sys_io_submit 402 common io_cancel sys_io_cancel -405 common exit_group sys_exit_group +405 common exit_group alpha_exit_group 406 common lookup_dcookie sys_lookup_dcookie 407 common epoll_create sys_epoll_create 408 common epoll_ctl sys_epoll_ctl > But I really wish we had some way to test and trigger this so that we > wouldn't get caught on this before. Something in task_pt_regs() that > catches "this doesn't actually work" and does a WARN_ON_ONCE() on the > affected architectures? I think that would require pushing an extra magic value in SWITCH_STACK and not just popping it but deliberately changing that value in UNDO_SWITCH_STACK. Basically stack canaries. I don't see how we could do it in an arch independent way though. Which means it will require auditing all of the architectures to get there. Volunteers? This is looking straight forward enough that I can probably pull something together, just don't count on me to have it done in anything resembling a timely manner. Eric ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-11 21:39 ` Eric W. Biederman @ 2021-06-11 23:26 ` Linus Torvalds 2021-06-13 21:54 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-11 23:26 UTC (permalink / raw) To: Eric W. Biederman Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz, Kees Cook On Fri, Jun 11, 2021 at 2:40 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Looking at copy_thread it looks like at least on alpha we are dealing > with a structure that defines all of the registers in copy_thread. On the target side, yes. On the _source_ side, the code does struct pt_regs *regs = current_pt_regs(); and that's the part that means that fork() and related functions need to have done that DO_SWITCH_STACK(), so that they have the full register set to be copied. Otherwise it would copy random contents from the source stack. But that if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) { ends up protecting us, and the code never uses that set of source registers for the io worker threads. So io_uring looks fine on alpha. I didn't check m68k and friends, but I think they have the same thing going. > It looks like we just need something like this to cover the userspace > side of exit. Looks correct to me. Except I think you could just use "fork_like()" instead of creating a new (and identical) "exit_like()" macro. > > But I really wish we had some way to test and trigger this so that we > > wouldn't get caught on this before. Something in task_pt_regs() that > > catches "this doesn't actually work" and does a WARN_ON_ONCE() on the > > affected architectures? > > I think that would require pushing an extra magic value in SWITCH_STACK > and not just popping it but deliberately changing that value in > UNDO_SWITCH_STACK. Basically stack canaries. > > I don't see how we could do it in an arch independent way though. No, I think you're right. There's no obvious generic solution to it, and once we look at arch-specific ones we're vback to "just alpha, m68k and nios needs this or cares" and tonce you're there you might as well just fix it. ia64 has soem "fast system call" model with limited registers too, but I think that's limited to just a few very special system calls (ie it does the reverse of what alpha does: alpha does the fast case by default, and then marks fork/vfork/clone as special). Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-11 23:26 ` Linus Torvalds @ 2021-06-13 21:54 ` Eric W. Biederman 2021-06-13 22:18 ` Linus Torvalds 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-13 21:54 UTC (permalink / raw) To: Linus Torvalds Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Fri, Jun 11, 2021 at 2:40 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> Looking at copy_thread it looks like at least on alpha we are dealing >> with a structure that defines all of the registers in copy_thread. > > On the target side, yes. > > On the _source_ side, the code does > > struct pt_regs *regs = current_pt_regs(); > > and that's the part that means that fork() and related functions need > to have done that DO_SWITCH_STACK(), so that they have the full > register set to be copied. > > Otherwise it would copy random contents from the source stack. > > But that > > if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) { > > ends up protecting us, and the code never uses that set of source > registers for the io worker threads. The test in copy_thread. That isn't the case I am worried about. > So io_uring looks fine on alpha. I didn't check m68k and friends, but > I think they have the same thing going. As I have read through the code more I don't think so. The code paths I am worried about are: ret_from_kernel_thread io_wqe_worker get_signal do_coredump ptrace_stop ret_from_kernel_thread io_sq_thread get_signal do_coredump ptrace_stop As I understand the code the new thread created by create_thread initially has a full complement of registers, and then is started by alpha_switch_to: .align 4 .globl alpha_switch_to .type alpha_switch_to, @function .cfi_startproc alpha_switch_to: DO_SWITCH_STACK call_pal PAL_swpctx lda $8, 0x3fff UNDO_SWITCH_STACK bic $sp, $8, $8 mov $17, $0 ret .cfi_endproc .size alpha_switch_to, .-alpha_switch_to The alpha_switch_to will remove the extra registers from the stack and then call ret which if I understand alpha assembly correctly is equivalent to jumping to where $26 points. Which is ret_from_kernel_thread (as setup by copy_thread). Which leaves ret_from_kernel_thread and everything it calls without the extra context saved on the stack. I am still trying to understand how we get registers populated at a fixed offset on the stack during schedule. As it looks like switch_to assumes the stack pointer is in the proper location. >> It looks like we just need something like this to cover the userspace >> side of exit. > > Looks correct to me. Except I think you could just use "fork_like()" > instead of creating a new (and identical) "exit_like()" macro. > >> > But I really wish we had some way to test and trigger this so that we >> > wouldn't get caught on this before. Something in task_pt_regs() that >> > catches "this doesn't actually work" and does a WARN_ON_ONCE() on the >> > affected architectures? >> >> I think that would require pushing an extra magic value in SWITCH_STACK >> and not just popping it but deliberately changing that value in >> UNDO_SWITCH_STACK. Basically stack canaries. >> >> I don't see how we could do it in an arch independent way though. > > No, I think you're right. There's no obvious generic solution to it, > and once we look at arch-specific ones we're vback to "just alpha, > m68k and nios needs this or cares" and tonce you're there you might as > well just fix it. > > ia64 has soem "fast system call" model with limited registers too, but > I think that's limited to just a few very special system calls (ie it > does the reverse of what alpha does: alpha does the fast case by > default, and then marks fork/vfork/clone as special). I wonder if the arch specific solution should be to move the registers to a fixed location in task_struct (perhaps thread_struct ) so that the same patterns can apply across all architectures and we don't get surprises at all. What appears to be unique about alpha, m68k, and nios is that space is not always reserved for all of the registers, so we can't always count on them being saved after a task switch. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-13 21:54 ` Eric W. Biederman @ 2021-06-13 22:18 ` Linus Torvalds 2021-06-14 2:05 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-13 22:18 UTC (permalink / raw) To: Eric W. Biederman Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz, Kees Cook [-- Attachment #1: Type: text/plain, Size: 1127 bytes --] On Sun, Jun 13, 2021 at 2:55 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > The alpha_switch_to will remove the extra registers from the stack and > then call ret which if I understand alpha assembly correctly is > equivalent to jumping to where $26 points. Which is > ret_from_kernel_thread (as setup by copy_thread). > > Which leaves ret_from_kernel_thread and everything it calls without > the extra context saved on the stack. Uhhuh. Right you are, I think. It's been ages since I worked on that code and my alpha handbook is somewhere else, but yes, when alpha_switch_to() has context-switched to the new PCB state, it will then pop those registers in the new context and return. So we do set up the right stack frame for the worker thread, but as you point out, it then gets used up immediately when running. So by the time the IO worker thread calls get_signal(), it's no longer useful. How very annoying. The (obviously UNTESTED) patch might be something like the attached. I wouldn't be surprised if m68k has the exact same thing for the exact same reason, but I didn't check.. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 982 bytes --] arch/alpha/kernel/process.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c index 5112ab996394..edbfe03f4b2c 100644 --- a/arch/alpha/kernel/process.c +++ b/arch/alpha/kernel/process.c @@ -251,8 +251,17 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) { /* kernel thread */ + /* + * Give it *two* switch stacks, one for the kernel + * state return that is used up by alpha_switch_to, + * and one for the "user state" which is accessed + * by ptrace. + */ + childstack--; + childti->pcb.ksp = (unsigned long) childstack; + memset(childstack, 0, - sizeof(struct switch_stack) + sizeof(struct pt_regs)); + 2*sizeof(struct switch_stack) + sizeof(struct pt_regs)); childstack->r26 = (unsigned long) ret_from_kernel_thread; childstack->r9 = usp; /* function */ childstack->r10 = kthread_arg; ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-13 22:18 ` Linus Torvalds @ 2021-06-14 2:05 ` Michael Schmitz 2021-06-14 5:03 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-14 2:05 UTC (permalink / raw) To: Linus Torvalds, Eric W. Biederman Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Daniel Jacobowitz, Kees Cook Hi Linus, On 14/06/21 10:18 am, Linus Torvalds wrote: > On Sun, Jun 13, 2021 at 2:55 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> The alpha_switch_to will remove the extra registers from the stack and >> then call ret which if I understand alpha assembly correctly is >> equivalent to jumping to where $26 points. Which is >> ret_from_kernel_thread (as setup by copy_thread). >> >> Which leaves ret_from_kernel_thread and everything it calls without >> the extra context saved on the stack. > Uhhuh. Right you are, I think. It's been ages since I worked on that > code and my alpha handbook is somewhere else, but yes, when > alpha_switch_to() has context-switched to the new PCB state, it will > then pop those registers in the new context and return. > > So we do set up the right stack frame for the worker thread, but as > you point out, it then gets used up immediately when running. So by > the time the IO worker thread calls get_signal(), it's no longer > useful. > > How very annoying. > > The (obviously UNTESTED) patch might be something like the attached. > > I wouldn't be surprised if m68k has the exact same thing for the exact > same reason, but I didn't check.. m68k is indeed similar, it has: if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) { /* kernel thread */ memset(frame, 0, sizeof(struct fork_frame)); frame->regs.sr = PS_S; frame->sw.a3 = usp; /* function */ frame->sw.d7 = arg; frame->sw.retpc = (unsigned long)ret_from_kernel_thread; p->thread.usp = 0; return 0; } so a similar patch should be possible. Cheers, Michael > > Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-14 2:05 ` Michael Schmitz @ 2021-06-14 5:03 ` Michael Schmitz 2021-06-14 16:26 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-14 5:03 UTC (permalink / raw) To: Linus Torvalds, Eric W. Biederman Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On second thought, I'm not certain what adding another empty stack frame would achieve here. On m68k, 'frame' already is a new stack frame, for running the new thread in. This new frame does not have any user context at all, and it's explicitly wiped anyway. Unless we save all user context on the stack, then push that context to a new save frame, and somehow point get_signal to look there for IO threads (essentially what Eric suggested), I don't see how this could work? I must be missing something. Cheers, Michael Schmitz Am 14.06.2021 um 14:05 schrieb Michael Schmitz: >> >> I wouldn't be surprised if m68k has the exact same thing for the exact >> same reason, but I didn't check.. > > m68k is indeed similar, it has: > > if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) { > /* kernel thread */ > memset(frame, 0, sizeof(struct fork_frame)); > frame->regs.sr = PS_S; > frame->sw.a3 = usp; /* function */ > frame->sw.d7 = arg; > frame->sw.retpc = (unsigned long)ret_from_kernel_thread; > p->thread.usp = 0; > return 0; > } > > so a similar patch should be possible. > > Cheers, > > Michael > > > >> >> Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-14 5:03 ` Michael Schmitz @ 2021-06-14 16:26 ` Eric W. Biederman 2021-06-14 22:26 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-14 16:26 UTC (permalink / raw) To: Michael Schmitz Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Michael Schmitz <schmitzmic@gmail.com> writes: > On second thought, I'm not certain what adding another empty stack frame would > achieve here. > > On m68k, 'frame' already is a new stack frame, for running the new thread > in. This new frame does not have any user context at all, and it's explicitly > wiped anyway. > > Unless we save all user context on the stack, then push that context to a new > save frame, and somehow point get_signal to look there for IO threads > (essentially what Eric suggested), I don't see how this could work? > > I must be missing something. It is only designed to work well enough so that ptrace will access something well defined when ptrace accesses io_uring tasks. The io_uring tasks are special in that they are user process threads that never run in userspace. So as long as everything ptrace can read is accessible on that process all is well. Having stared a bit longer at the code I think the short term fix for both of PTRACE_EVENT_EXIT and io_uring is to guard them both with CONFIG_HAVE_ARCH_TRACEHOOK. Today CONFIG_HAVE_ARCH_TRACEHOOK guards access to /proc/self/syscall. Which out of necessity ensures that user context is always readable. Which seems to solve both the PTRACE_EVENT_EXIT and the io_uring problems. What I especially like about that is there are a lot of other reasons to encourage architectures in a CONFIG_HAVE_ARCH_TRACEHOOK direction. I think the biggies are getting architectures to store the extra saved state on context switch into some place in task_struct and to implement the regset view of registers. Hmm. This is odd. CONFIG_HAVE_ARCH_TRACEHOOK is supposed to imply CORE_DUMP_USE_REGSET. But alpha, csky, h8300, m68k, microblaze, nds32 don't implement CORE_DUMP_USE_REGSET but nds32 implements CONFIG_ARCH_HAVE_TRACEHOOK. I will keep digging and see what clean code I can come up with. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-14 16:26 ` Eric W. Biederman @ 2021-06-14 22:26 ` Michael Schmitz 2021-06-15 19:30 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-14 22:26 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Eric, On 15/06/21 4:26 am, Eric W. Biederman wrote: > Michael Schmitz <schmitzmic@gmail.com> writes: > >> On second thought, I'm not certain what adding another empty stack frame would >> achieve here. >> >> On m68k, 'frame' already is a new stack frame, for running the new thread >> in. This new frame does not have any user context at all, and it's explicitly >> wiped anyway. >> >> Unless we save all user context on the stack, then push that context to a new >> save frame, and somehow point get_signal to look there for IO threads >> (essentially what Eric suggested), I don't see how this could work? >> >> I must be missing something. > It is only designed to work well enough so that ptrace will access > something well defined when ptrace accesses io_uring tasks. > > The io_uring tasks are special in that they are user process > threads that never run in userspace. So as long as everything > ptrace can read is accessible on that process all is well. OK, I'm testing a patch that would save extra context in sys_io_uring_setup, which ought to ensure that for m68k. > Having stared a bit longer at the code I think the short term > fix for both of PTRACE_EVENT_EXIT and io_uring is to guard > them both with CONFIG_HAVE_ARCH_TRACEHOOK. Fair enough :-) Cheers, Michael > > Today CONFIG_HAVE_ARCH_TRACEHOOK guards access to /proc/self/syscall. > Which out of necessity ensures that user context is always readable. > Which seems to solve both the PTRACE_EVENT_EXIT and the io_uring > problems. > > What I especially like about that is there are a lot of other reasons > to encourage architectures in a CONFIG_HAVE_ARCH_TRACEHOOK direction. > I think the biggies are getting architectures to store the extra > saved state on context switch into some place in task_struct > and to implement the regset view of registers. > > Hmm. This is odd. CONFIG_HAVE_ARCH_TRACEHOOK is supposed to imply > CORE_DUMP_USE_REGSET. But alpha, csky, h8300, m68k, microblaze, nds32 > don't implement CORE_DUMP_USE_REGSET but nds32 implements > CONFIG_ARCH_HAVE_TRACEHOOK. > > I will keep digging and see what clean code I can come up with. > > Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-14 22:26 ` Michael Schmitz @ 2021-06-15 19:30 ` Eric W. Biederman 2021-06-15 19:36 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman ` (3 more replies) 0 siblings, 4 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-15 19:30 UTC (permalink / raw) To: Michael Schmitz Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Michael Schmitz <schmitzmic@gmail.com> writes: > Hi Eric, > > On 15/06/21 4:26 am, Eric W. Biederman wrote: >> Michael Schmitz <schmitzmic@gmail.com> writes: >> >>> On second thought, I'm not certain what adding another empty stack frame would >>> achieve here. >>> >>> On m68k, 'frame' already is a new stack frame, for running the new thread >>> in. This new frame does not have any user context at all, and it's explicitly >>> wiped anyway. >>> >>> Unless we save all user context on the stack, then push that context to a new >>> save frame, and somehow point get_signal to look there for IO threads >>> (essentially what Eric suggested), I don't see how this could work? >>> >>> I must be missing something. >> It is only designed to work well enough so that ptrace will access >> something well defined when ptrace accesses io_uring tasks. >> >> The io_uring tasks are special in that they are user process >> threads that never run in userspace. So as long as everything >> ptrace can read is accessible on that process all is well. > OK, I'm testing a patch that would save extra context in sys_io_uring_setup, > which ought to ensure that for m68k. I had to update ret_from_kernel_thread to pop that state to get Linus's change to boot. Apparently kernel_threads exiting needs to be handled. >> Having stared a bit longer at the code I think the short term >> fix for both of PTRACE_EVENT_EXIT and io_uring is to guard >> them both with CONFIG_HAVE_ARCH_TRACEHOOK. Which does not work because nios2 which looks susceptible sets CONFIG_HAVE_ARCH_TRACEHOOK. A further look shows that there is also PTRACE_EVENT_EXEC that needs to be handled so execve and execveat need to be wrapped as well. Do you happen to know if there is userspace that will run in qemu-system-m68k that can be used for testing? Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads 2021-06-15 19:30 ` Eric W. Biederman @ 2021-06-15 19:36 ` Eric W. Biederman 2021-06-15 22:02 ` Linus Torvalds 2021-06-16 20:50 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Al Viro 2021-06-15 20:56 ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Michael Schmitz ` (2 subsequent siblings) 3 siblings, 2 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-15 19:36 UTC (permalink / raw) To: Michael Schmitz Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook While thinking about the information leaks fixed in 77f6ab8b7768 ("don't dump the threads that had been already exiting when zapped.") I realized the problem is much more general than just coredumps and exit_mm. We have io_uring threads, PTRACE_EVENT_EXEC and PTRACE_EVENT_EXIT where ptrace is allowed to access userspace registers, but on some architectures has not saved them. The function alpha_switch_to does something reasonable it saves the floating point registers and the caller saved registers and switches to a different thread. Any register the caller is not expected to save it does not save. Meanhile the system call entry point on alpha also does something reasonable. The system call entry point saves the all but the caller saved integer registers and doesn't touch the floating point registers as the kernel code does not touch them. This is a nice happy fast path until the kernel wants to access the user space's registers through ptrace or similar. As user spaces's caller saved registers may be saved at an unpredictable point in the kernel code's stack, the routime which may stop and make the userspace registers available must be wrapped by code that will first save a switch stack frame at the bottom of the call stack, call the code that may access those registers and then pop the switch stack frame. The practical problem with this code structure is that this results in a game of whack-a-mole wrapping different kernel system calls. Loosing the game of whack-a-mole results in a security hole where userspace can write arbitrary data to the kernel stack. I looked and there nothing I can do that is not arch specific, so whack the moles with a minimal backportable fix. This change survives boot testing on qemu-system-alpha. Cc: stable@vger.kernel.org Inspired-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism") History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- arch/alpha/kernel/entry.S | 21 +++++++++++++++++++++ arch/alpha/kernel/process.c | 11 ++++++++++- arch/alpha/kernel/syscalls/syscall.tbl | 8 ++++---- 3 files changed, 35 insertions(+), 5 deletions(-) diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S index e227f3a29a43..98bb5b805089 100644 --- a/arch/alpha/kernel/entry.S +++ b/arch/alpha/kernel/entry.S @@ -785,6 +785,7 @@ ret_from_kernel_thread: mov $9, $27 mov $10, $16 jsr $26, ($9) + lda $sp, SWITCH_STACK_SIZE($sp) br $31, ret_to_user .end ret_from_kernel_thread @@ -811,6 +812,26 @@ alpha_\name: fork_like fork fork_like vfork fork_like clone +fork_like exit +fork_like exit_group + +.macro exec_like name + .align 4 + .globl alpha_\name + .ent alpha_\name + .cfi_startproc +alpha_\name: + .prologue 0 + DO_SWITCH_STACK + jsr $26, sys_\name + UNDO_SWITCH_STACK + ret + .cfi_endproc +.end alpha_\name +.endm + +exec_like execve +exec_like execveat .macro sigreturn_like name .align 4 diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c index 5112ab996394..edbfe03f4b2c 100644 --- a/arch/alpha/kernel/process.c +++ b/arch/alpha/kernel/process.c @@ -251,8 +251,17 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) { /* kernel thread */ + /* + * Give it *two* switch stacks, one for the kernel + * state return that is used up by alpha_switch_to, + * and one for the "user state" which is accessed + * by ptrace. + */ + childstack--; + childti->pcb.ksp = (unsigned long) childstack; + memset(childstack, 0, - sizeof(struct switch_stack) + sizeof(struct pt_regs)); + 2*sizeof(struct switch_stack) + sizeof(struct pt_regs)); childstack->r26 = (unsigned long) ret_from_kernel_thread; childstack->r9 = usp; /* function */ childstack->r10 = kthread_arg; diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 3000a2e8ee21..5f85f3c11ed4 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -8,7 +8,7 @@ # The <abi> is always "common" for this file # 0 common osf_syscall alpha_syscall_zero -1 common exit sys_exit +1 common exit alpha_exit 2 common fork alpha_fork 3 common read sys_read 4 common write sys_write @@ -65,7 +65,7 @@ 56 common osf_revoke sys_ni_syscall 57 common symlink sys_symlink 58 common readlink sys_readlink -59 common execve sys_execve +59 common execve alpha_execve 60 common umask sys_umask 61 common chroot sys_chroot 62 common osf_old_fstat sys_ni_syscall @@ -333,7 +333,7 @@ 400 common io_getevents sys_io_getevents 401 common io_submit sys_io_submit 402 common io_cancel sys_io_cancel -405 common exit_group sys_exit_group +405 common exit_group alpha_exit_group 406 common lookup_dcookie sys_lookup_dcookie 407 common epoll_create sys_epoll_create 408 common epoll_ctl sys_epoll_ctl @@ -441,7 +441,7 @@ 510 common renameat2 sys_renameat2 511 common getrandom sys_getrandom 512 common memfd_create sys_memfd_create -513 common execveat sys_execveat +513 common execveat alpha_execveat 514 common seccomp sys_seccomp 515 common bpf sys_bpf 516 common userfaultfd sys_userfaultfd -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads 2021-06-15 19:36 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman @ 2021-06-15 22:02 ` Linus Torvalds 2021-06-16 16:32 ` Eric W. Biederman 2021-06-16 20:50 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Al Viro 1 sibling, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-15 22:02 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Tue, Jun 15, 2021 at 12:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > I looked and there nothing I can do that is not arch specific, so > whack the moles with a minimal backportable fix. > > This change survives boot testing on qemu-system-alpha. So as mentioned in the other thread, I think this patch is exactly right. However, the need for this part > @@ -785,6 +785,7 @@ ret_from_kernel_thread: > mov $9, $27 > mov $10, $16 > jsr $26, ($9) > + lda $sp, SWITCH_STACK_SIZE($sp) > br $31, ret_to_user > .end ret_from_kernel_thread obviously eluded me in my "how about something like this", and I had to really try to figure out why we'd ever return. Which is why I came to that "oooh - kernel_execve()" realization. It might be good to comment on that somewhere. And if you can think of some other case, that should be mentioned too. Anyway, thanks for looking into this odd case. And if you have a test-case for this all, it really would be a good thing. Yes, it should only affect a couple of odd-ball architectures, but still... It would also be good to hear that you actually did verify the behavior of this patch wrt that ptrace-of-io-worker-threads case.. Linus Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads 2021-06-15 22:02 ` Linus Torvalds @ 2021-06-16 16:32 ` Eric W. Biederman 2021-06-16 18:29 ` [PATCH 0/2] alpha/ptrace: Improved switch_stack handling Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 16:32 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, Jun 15, 2021 at 12:36 PM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> I looked and there nothing I can do that is not arch specific, so >> whack the moles with a minimal backportable fix. >> >> This change survives boot testing on qemu-system-alpha. > > So as mentioned in the other thread, I think this patch is exactly right. > > However, the need for this part > >> @@ -785,6 +785,7 @@ ret_from_kernel_thread: >> mov $9, $27 >> mov $10, $16 >> jsr $26, ($9) >> + lda $sp, SWITCH_STACK_SIZE($sp) >> br $31, ret_to_user >> .end ret_from_kernel_thread > > obviously eluded me in my "how about something like this", and I had > to really try to figure out why we'd ever return. > > Which is why I came to that "oooh - kernel_execve()" realization. > > It might be good to comment on that somewhere. And if you can think of > some other case, that should be mentioned too. > > Anyway, thanks for looking into this odd case. And if you have a > test-case for this all, it really would be a good thing. Yes, it > should only affect a couple of odd-ball architectures, but still... It > would also be good to hear that you actually did verify the behavior > of this patch wrt that ptrace-of-io-worker-threads case.. *Grumble* So just going through and looking to see what it takes to instrument and put in warnings when things go wrong I have found another issue. Today there exists: PTRACE_EVENT_FORK PTRACE_EVENT_VFORK PTRACE_EVENT_CLONE Which happens after the actual fork operation in the kernel. The following code wraps those operations in arch/alpha/kernel/entry.S .macro fork_like name .align 4 .globl alpha_\name .ent alpha_\name alpha_\name: .prologue 0 bsr $1, do_switch_stack jsr $26, sys_\name ldq $26, 56($sp) lda $sp, SWITCH_STACK_SIZE($sp) ret .end alpha_\name .endm The code in the kernel when calls in fork.c calls ptrace_event_pid which ultimately calls ptrace_stop. So userspace can reasonably expect to stop the process and change it's registers. With unconditionally popping the switch stack any of those registers that are modified are lost. So I will update my changes to handle that case as well. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 0/2] alpha/ptrace: Improved switch_stack handling 2021-06-16 16:32 ` Eric W. Biederman @ 2021-06-16 18:29 ` Eric W. Biederman 2021-06-16 18:31 ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman 2021-06-16 18:32 ` [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames Eric W. Biederman 0 siblings, 2 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 18:29 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook This pair of changes has not received anything beyond build and boot testing. I am posting these changes as they do a much better job of warning of problems and shutting down the security hole. Making them a much better pattern than the my last patch. I hope to get the test cases soon. arch/alpha/include/asm/thread_info.h | 2 ++ arch/alpha/kernel/entry.S | 62 ++++++++++++++++++++++++++-------- arch/alpha/kernel/process.c | 3 ++ arch/alpha/kernel/ptrace.c | 13 +++++-- arch/alpha/kernel/syscalls/syscall.tbl | 8 ++--- 5 files changed, 67 insertions(+), 21 deletions(-) Eric W. Biederman (2): alpha/ptrace: Record and handle the absence of switch_stack alpha/ptrace: Add missing switch_stack frames ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 18:29 ` [PATCH 0/2] alpha/ptrace: Improved switch_stack handling Eric W. Biederman @ 2021-06-16 18:31 ` Eric W. Biederman 2021-06-16 20:00 ` Linus Torvalds ` (2 more replies) 2021-06-16 18:32 ` [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames Eric W. Biederman 1 sibling, 3 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 18:31 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook While thinking about the information leaks fixed in 77f6ab8b7768 ("don't dump the threads that had been already exiting when zapped.") I realized the problem is much more general than just coredumps and exit_mm. We have io_uring threads, PTRACE_EVENT_FORK, PTRACE_EVENT_VFORK, PTRACE_EVENT_CLONE, PTRACE_EVENT_EXEC and PTRACE_EVENT_EXIT where ptrace is allowed to access userspace registers, but on some architectures has not saved them so they can be modified. The function alpha_switch_to does something reasonable it saves the floating point registers and the caller saved registers and switches to a different thread. Any register the caller is not expected to save it does not save. Meanhile the system call entry point on alpha also does something reasonable. The system call entry point saves all but the caller saved integer registers and doesn't touch the floating point registers as the kernel code does not touch them. This is a nice happy fast path until the kernel wants to access the user space's registers through ptrace or similar. As user spaces's caller saved registers may be saved at an unpredictable point in the kernel code's stack, the routine which may stop and make the userspace registers available must be wrapped by code that will first save a switch stack frame at the bottom of the call stack, call the code that may access those registers and then pop the switch stack frame. The practical problem with this code structure is that this results in a game of whack-a-mole wrapping different kernel system calls. Loosing the game of whack-a-mole results in a security hole where userspace can write arbitrary data to the kernel stack. In general it is not possible to prevent generic code introducing a ptrace_stop or register access not knowing alpha's limitations, that where alpha does not make all of the registers avaliable. Prevent security holes by recording when all of the registers are available so generic code changes do not result in security holes on alpha. Cc: stable@vger.kernel.org Fixes: dbe1bdbb39db ("io_uring: handle signals for IO threads like a normal thread") Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism") History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- arch/alpha/include/asm/thread_info.h | 2 ++ arch/alpha/kernel/entry.S | 38 ++++++++++++++++++++++------ arch/alpha/kernel/ptrace.c | 13 ++++++++-- 3 files changed, 43 insertions(+), 10 deletions(-) diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h index 2592356e3215..41e5986ed9c8 100644 --- a/arch/alpha/include/asm/thread_info.h +++ b/arch/alpha/include/asm/thread_info.h @@ -63,6 +63,7 @@ register struct thread_info *__current_thread_info __asm__("$8"); #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SYSCALL_AUDIT 4 /* syscall audit active */ #define TIF_NOTIFY_SIGNAL 5 /* signal notifications exist */ +#define TIF_ALLREGS_SAVED 6 /* both pt_regs and switch_stack saved */ #define TIF_DIE_IF_KERNEL 9 /* dik recursion lock */ #define TIF_MEMDIE 13 /* is terminating due to OOM killer */ #define TIF_POLLING_NRFLAG 14 /* idle is polling for TIF_NEED_RESCHED */ @@ -73,6 +74,7 @@ register struct thread_info *__current_thread_info __asm__("$8"); #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) #define _TIF_SYSCALL_AUDIT (1<<TIF_SYSCALL_AUDIT) #define _TIF_NOTIFY_SIGNAL (1<<TIF_NOTIFY_SIGNAL) +#define _TIF_ALLREGS_SAVED (1<<TIF_ALLREGS_SAVED) #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) /* Work to do on interrupt/exception return. */ diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S index e227f3a29a43..c1edf54dc035 100644 --- a/arch/alpha/kernel/entry.S +++ b/arch/alpha/kernel/entry.S @@ -174,6 +174,28 @@ .cfi_adjust_cfa_offset -SWITCH_STACK_SIZE .endm +.macro SAVE_SWITCH_STACK + DO_SWITCH_STACK +1: ldl_l $1, TI_FLAGS($8) + bis $1, _TIF_ALLREGS_SAVED, $1 + stl_c $1, TI_FLAGS($8) + beq $1, 2f +.subsection 2 +2: br 1b +.previous +.endm + +.macro RESTORE_SWITCH_STACK +1: ldl_l $1, TI_FLAGS($8) + bic $1, _TIF_ALLREGS_SAVED, $1 + stl_c $1, TI_FLAGS($8) + beq $1, 2f +.subsection 2 +2: br 1b +.previous + UNDO_SWITCH_STACK +.endm + /* * Non-syscall kernel entry points. */ @@ -559,9 +581,9 @@ $work_resched: $work_notifysig: mov $sp, $16 - DO_SWITCH_STACK + SAVE_SWITCH_STACK jsr $26, do_work_pending - UNDO_SWITCH_STACK + RESTORE_SWITCH_STACK br restore_all /* @@ -572,9 +594,9 @@ $work_notifysig: .type strace, @function strace: /* set up signal stack, call syscall_trace */ - DO_SWITCH_STACK + SAVE_SWITCH_STACK jsr $26, syscall_trace_enter /* returns the syscall number */ - UNDO_SWITCH_STACK + RESTORE_SWITCH_STACK /* get the arguments back.. */ ldq $16, SP_OFF+24($sp) @@ -602,9 +624,9 @@ ret_from_straced: $strace_success: stq $0, 0($sp) /* save return value */ - DO_SWITCH_STACK + SAVE_SWITCH_STACK jsr $26, syscall_trace_leave - UNDO_SWITCH_STACK + RESTORE_SWITCH_STACK br $31, ret_from_sys_call .align 3 @@ -618,13 +640,13 @@ $strace_error: stq $0, 0($sp) stq $1, 72($sp) /* a3 for return */ - DO_SWITCH_STACK + SAVE_SWITCH_STACK mov $18, $9 /* save old syscall number */ mov $19, $10 /* save old a3 */ jsr $26, syscall_trace_leave mov $9, $18 mov $10, $19 - UNDO_SWITCH_STACK + RESTORE_SWITCH_STACK mov $31, $26 /* tell "ret_from_sys_call" we can restart */ br ret_from_sys_call diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c index 8c43212ae38e..41fb994f36dc 100644 --- a/arch/alpha/kernel/ptrace.c +++ b/arch/alpha/kernel/ptrace.c @@ -117,7 +117,13 @@ get_reg_addr(struct task_struct * task, unsigned long regno) zero = 0; addr = &zero; } else { - addr = task_stack_page(task) + regoff[regno]; + int off = regoff[regno]; + if (WARN_ON_ONCE((off < PT_REG(r0)) && + !test_ti_thread_flag(task_thread_info(task), + TIF_ALLREGS_SAVED))) + addr = &zero; + else + addr = task_stack_page(task) + off; } return addr; } @@ -145,13 +151,16 @@ get_reg(struct task_struct * task, unsigned long regno) static int put_reg(struct task_struct *task, unsigned long regno, unsigned long data) { + unsigned long *addr; if (regno == 63) { task_thread_info(task)->ieee_state = ((task_thread_info(task)->ieee_state & ~IEEE_SW_MASK) | (data & IEEE_SW_MASK)); data = (data & FPCR_DYN_MASK) | ieee_swcr_to_fpcr(data); } - *get_reg_addr(task, regno) = data; + addr = get_reg_addr(task, regno); + if (addr != &zero) + *addr = data; return 0; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 18:31 ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman @ 2021-06-16 20:00 ` Linus Torvalds 2021-06-16 20:37 ` Linus Torvalds 2021-06-16 20:42 ` Eric W. Biederman 2021-06-16 20:17 ` Al Viro 2021-06-21 2:01 ` Michael Schmitz 2 siblings, 2 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-16 20:00 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 11:32 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Prevent security holes by recording when all of the registers are > available so generic code changes do not result in security holes > on alpha. Please no, not this way. ldl/stc is extremely expensive on some alpha cpus. I really think thatTIF_ALLREGS_SAVED bit isn't worth it, except perhaps for debugging. And even for debugging, I think it would be both easier and cheaper to just add a magic word to the entry stack instead. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 20:00 ` Linus Torvalds @ 2021-06-16 20:37 ` Linus Torvalds 2021-06-16 20:57 ` Eric W. Biederman 2021-06-16 20:42 ` Eric W. Biederman 1 sibling, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-16 20:37 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 1:00 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And even for debugging, I think it would be both easier and cheaper to > just add a magic word to the entry stack instead. IOW, just add a unsigned long magic; to "struct switch_stack", and then make the stack switch code push that value. That would be cheap enough to be just unconditional, but you could make it depend on a debug config option too, of course. It helps if 'xyz' is some constant that is easyish to generate. It might not be a constant - maybe it could be the address of that 'magic' field itself, so you'd just generate it with stq $r,($r) and it would be equally easy to just validate at lookup for that WARN_ON_ONCE(): WARN_ON_ONCE(switch_stack->magic != (unsigned long)&switch_stack->magic); or whatever. It's for debugging, not security. So it doesn't have to be some kind of super-great magic number, just something easy to generate and check (that isn't a common value like "0" that trivially exist on the stack anyway). Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 20:37 ` Linus Torvalds @ 2021-06-16 20:57 ` Eric W. Biederman 2021-06-16 21:02 ` Al Viro 2021-06-16 21:08 ` Linus Torvalds 0 siblings, 2 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 20:57 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, Jun 16, 2021 at 1:00 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> And even for debugging, I think it would be both easier and cheaper to >> just add a magic word to the entry stack instead. > > IOW, just add a > > unsigned long magic; > > to "struct switch_stack", and then make the stack switch code push that value. > > That would be cheap enough to be just unconditional, but you could > make it depend on a debug config option too, of course. > > It helps if 'xyz' is some constant that is easyish to generate. It > might not be a constant - maybe it could be the address of that > 'magic' field itself, so you'd just generate it with > > stq $r,($r) > > and it would be equally easy to just validate at lookup for that WARN_ON_ONCE(): > > WARN_ON_ONCE(switch_stack->magic != (unsigned long)&switch_stack->magic); > > or whatever. > > It's for debugging, not security. So it doesn't have to be some kind > of super-great magic number, just something easy to generate and check > (that isn't a common value like "0" that trivially exist on the stack > anyway). Fair enough. I was thinking for a moment that do_sigreturn might have a problem with that but restore_sigcontext makes it clear that struct switch_stack is not exposed to userspace. Do you know if struct switch_stack or pt_regs is ever exposeed to usespace? They are both defined in arch/alpha/include/uapi/asm/ptrace.h which makes me think userspace must see those definitions somewhere. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 20:57 ` Eric W. Biederman @ 2021-06-16 21:02 ` Al Viro 2021-06-16 21:08 ` Linus Torvalds 1 sibling, 0 replies; 119+ messages in thread From: Al Viro @ 2021-06-16 21:02 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 03:57:19PM -0500, Eric W. Biederman wrote: > Do you know if struct switch_stack or pt_regs is ever exposeed to > usespace? They are both defined in arch/alpha/include/uapi/asm/ptrace.h > which makes me think userspace must see those definitions somewhere. They are exposed, but why mess with those in the first place? thread_info->status is strictly thread-synchronous, so just use it and be done with that... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 20:57 ` Eric W. Biederman 2021-06-16 21:02 ` Al Viro @ 2021-06-16 21:08 ` Linus Torvalds 1 sibling, 0 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-16 21:08 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 1:57 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Do you know if struct switch_stack or pt_regs is ever exposeed to > usespace? They are both defined in arch/alpha/include/uapi/asm/ptrace.h > which makes me think userspace must see those definitions somewhere. Yeah, that uapi location is a bit unfortunate. It means that user space _could_ have seen it. Which probably means that some user space uses it. Not for any kernel interfaces (the alpha ptrace register offsets are actually sane, and we have that "regoff[]" array to find them) - but I could see some odd program having decided to use the kernel pt_regs and switch_stack structures for their own reasons. Annoying. Because we don't really expose it as-is in any way, afaik. Only incidentally - and by mistake - in a uapi header file. Maybe a flag in thread_info->status (or even a new 32-bit field entirely in thread_info) is the way to go like Al says. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 20:00 ` Linus Torvalds 2021-06-16 20:37 ` Linus Torvalds @ 2021-06-16 20:42 ` Eric W. Biederman 1 sibling, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 20:42 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, Jun 16, 2021 at 11:32 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> Prevent security holes by recording when all of the registers are >> available so generic code changes do not result in security holes >> on alpha. > > Please no, not this way. ldl/stc is extremely expensive on some alpha cpus. > > I really think thatTIF_ALLREGS_SAVED bit isn't worth it, except > perhaps for debugging. > > And even for debugging, I think it would be both easier and cheaper to > just add a magic word to the entry stack instead. I think I can do something like that. Looking at arch/alpha/asm/cache.h it looks like alpha had either 32byte or 64bit cachelines. Which makes struct switch_stack a full 10 or 5 cachelines in size. So pushing something extra might hit an extra cacheline. However it looks like struct pt_regs is 16 bytes short of a full cache line so struct switch_stack isn't going to be cacheline aligned. Adding an extra 8 bytes of magic number will hopefully be in the noise. If I can I would like to find something that is cheap enough that I can always leave on. Mostly because there is little enough testing that a bug that allows anyone to stomp the kernel stack has existed for 17 years without being noticed. If you want it to be a debug option only I can certainly make that happen. I am still going "Eek! Arbitrary stack smash!" in my head. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 18:31 ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman 2021-06-16 20:00 ` Linus Torvalds @ 2021-06-16 20:17 ` Al Viro 2021-06-21 2:01 ` Michael Schmitz 2 siblings, 0 replies; 119+ messages in thread From: Al Viro @ 2021-06-16 20:17 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 01:31:52PM -0500, Eric W. Biederman wrote: > +.macro SAVE_SWITCH_STACK > + DO_SWITCH_STACK > +1: ldl_l $1, TI_FLAGS($8) > + bis $1, _TIF_ALLREGS_SAVED, $1 > + stl_c $1, TI_FLAGS($8) > + beq $1, 2f > +.subsection 2 > +2: br 1b > +.previous > +.endm What the hell? *IF* you are going to go that way, at least put it into ->status, not ->flag - those are thread-synchronous and do not require that kind of masturbation. > +.macro RESTORE_SWITCH_STACK > +1: ldl_l $1, TI_FLAGS($8) > + bic $1, _TIF_ALLREGS_SAVED, $1 > + stl_c $1, TI_FLAGS($8) > + beq $1, 2f > +.subsection 2 > +2: br 1b > +.previous > + UNDO_SWITCH_STACK > +.endm Ditto. What do you need that flag for, anyway? > @@ -117,7 +117,13 @@ get_reg_addr(struct task_struct * task, unsigned long regno) > zero = 0; > addr = &zero; > } else { > - addr = task_stack_page(task) + regoff[regno]; > + int off = regoff[regno]; > + if (WARN_ON_ONCE((off < PT_REG(r0)) && > + !test_ti_thread_flag(task_thread_info(task), > + TIF_ALLREGS_SAVED))) > + addr = &zero; > + else > + addr = task_stack_page(task) + off; A sanity check in slow path, buggering the hell out of a fast path? ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-16 18:31 ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman 2021-06-16 20:00 ` Linus Torvalds 2021-06-16 20:17 ` Al Viro @ 2021-06-21 2:01 ` Michael Schmitz 2021-06-21 2:17 ` Linus Torvalds 2021-06-21 2:27 ` Al Viro 2 siblings, 2 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-21 2:01 UTC (permalink / raw) To: Eric W. Biederman, Linus Torvalds Cc: linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Eric, instrumenting get_reg on m68k and using a similar patch to yours to warn when unsaved registers are accessed on the switch stack, I get a hit from getegid and getegid32, just by running a simple ptrace on ls. Going to wack those two moles now ... Cheers, Michael On 17/06/21 6:31 am, Eric W. Biederman wrote: > While thinking about the information leaks fixed in 77f6ab8b7768 > ("don't dump the threads that had been already exiting when zapped.") > I realized the problem is much more general than just coredumps and > exit_mm. We have io_uring threads, PTRACE_EVENT_FORK, > PTRACE_EVENT_VFORK, PTRACE_EVENT_CLONE, PTRACE_EVENT_EXEC and > PTRACE_EVENT_EXIT where ptrace is allowed to access userspace > registers, but on some architectures has not saved them so > they can be modified. > > The function alpha_switch_to does something reasonable it saves the > floating point registers and the caller saved registers and switches > to a different thread. Any register the caller is not expected to > save it does not save. > > Meanhile the system call entry point on alpha also does something > reasonable. The system call entry point saves all but the caller > saved integer registers and doesn't touch the floating point registers > as the kernel code does not touch them. > > This is a nice happy fast path until the kernel wants to access the > user space's registers through ptrace or similar. As user spaces's > caller saved registers may be saved at an unpredictable point in the > kernel code's stack, the routine which may stop and make the userspace > registers available must be wrapped by code that will first save a > switch stack frame at the bottom of the call stack, call the code that > may access those registers and then pop the switch stack frame. > > The practical problem with this code structure is that this results in > a game of whack-a-mole wrapping different kernel system calls. Loosing > the game of whack-a-mole results in a security hole where userspace can > write arbitrary data to the kernel stack. > > In general it is not possible to prevent generic code introducing a > ptrace_stop or register access not knowing alpha's limitations, that > where alpha does not make all of the registers avaliable. > > Prevent security holes by recording when all of the registers are > available so generic code changes do not result in security holes > on alpha. > > Cc: stable@vger.kernel.org > Fixes: dbe1bdbb39db ("io_uring: handle signals for IO threads like a normal thread") > Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") > Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism") > History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > arch/alpha/include/asm/thread_info.h | 2 ++ > arch/alpha/kernel/entry.S | 38 ++++++++++++++++++++++------ > arch/alpha/kernel/ptrace.c | 13 ++++++++-- > 3 files changed, 43 insertions(+), 10 deletions(-) > > diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h > index 2592356e3215..41e5986ed9c8 100644 > --- a/arch/alpha/include/asm/thread_info.h > +++ b/arch/alpha/include/asm/thread_info.h > @@ -63,6 +63,7 @@ register struct thread_info *__current_thread_info __asm__("$8"); > #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ > #define TIF_SYSCALL_AUDIT 4 /* syscall audit active */ > #define TIF_NOTIFY_SIGNAL 5 /* signal notifications exist */ > +#define TIF_ALLREGS_SAVED 6 /* both pt_regs and switch_stack saved */ > #define TIF_DIE_IF_KERNEL 9 /* dik recursion lock */ > #define TIF_MEMDIE 13 /* is terminating due to OOM killer */ > #define TIF_POLLING_NRFLAG 14 /* idle is polling for TIF_NEED_RESCHED */ > @@ -73,6 +74,7 @@ register struct thread_info *__current_thread_info __asm__("$8"); > #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) > #define _TIF_SYSCALL_AUDIT (1<<TIF_SYSCALL_AUDIT) > #define _TIF_NOTIFY_SIGNAL (1<<TIF_NOTIFY_SIGNAL) > +#define _TIF_ALLREGS_SAVED (1<<TIF_ALLREGS_SAVED) > #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) > > /* Work to do on interrupt/exception return. */ > diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S > index e227f3a29a43..c1edf54dc035 100644 > --- a/arch/alpha/kernel/entry.S > +++ b/arch/alpha/kernel/entry.S > @@ -174,6 +174,28 @@ > .cfi_adjust_cfa_offset -SWITCH_STACK_SIZE > .endm > > +.macro SAVE_SWITCH_STACK > + DO_SWITCH_STACK > +1: ldl_l $1, TI_FLAGS($8) > + bis $1, _TIF_ALLREGS_SAVED, $1 > + stl_c $1, TI_FLAGS($8) > + beq $1, 2f > +.subsection 2 > +2: br 1b > +.previous > +.endm > + > +.macro RESTORE_SWITCH_STACK > +1: ldl_l $1, TI_FLAGS($8) > + bic $1, _TIF_ALLREGS_SAVED, $1 > + stl_c $1, TI_FLAGS($8) > + beq $1, 2f > +.subsection 2 > +2: br 1b > +.previous > + UNDO_SWITCH_STACK > +.endm > + > /* > * Non-syscall kernel entry points. > */ > @@ -559,9 +581,9 @@ $work_resched: > > $work_notifysig: > mov $sp, $16 > - DO_SWITCH_STACK > + SAVE_SWITCH_STACK > jsr $26, do_work_pending > - UNDO_SWITCH_STACK > + RESTORE_SWITCH_STACK > br restore_all > > /* > @@ -572,9 +594,9 @@ $work_notifysig: > .type strace, @function > strace: > /* set up signal stack, call syscall_trace */ > - DO_SWITCH_STACK > + SAVE_SWITCH_STACK > jsr $26, syscall_trace_enter /* returns the syscall number */ > - UNDO_SWITCH_STACK > + RESTORE_SWITCH_STACK > > /* get the arguments back.. */ > ldq $16, SP_OFF+24($sp) > @@ -602,9 +624,9 @@ ret_from_straced: > $strace_success: > stq $0, 0($sp) /* save return value */ > > - DO_SWITCH_STACK > + SAVE_SWITCH_STACK > jsr $26, syscall_trace_leave > - UNDO_SWITCH_STACK > + RESTORE_SWITCH_STACK > br $31, ret_from_sys_call > > .align 3 > @@ -618,13 +640,13 @@ $strace_error: > stq $0, 0($sp) > stq $1, 72($sp) /* a3 for return */ > > - DO_SWITCH_STACK > + SAVE_SWITCH_STACK > mov $18, $9 /* save old syscall number */ > mov $19, $10 /* save old a3 */ > jsr $26, syscall_trace_leave > mov $9, $18 > mov $10, $19 > - UNDO_SWITCH_STACK > + RESTORE_SWITCH_STACK > > mov $31, $26 /* tell "ret_from_sys_call" we can restart */ > br ret_from_sys_call > diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c > index 8c43212ae38e..41fb994f36dc 100644 > --- a/arch/alpha/kernel/ptrace.c > +++ b/arch/alpha/kernel/ptrace.c > @@ -117,7 +117,13 @@ get_reg_addr(struct task_struct * task, unsigned long regno) > zero = 0; > addr = &zero; > } else { > - addr = task_stack_page(task) + regoff[regno]; > + int off = regoff[regno]; > + if (WARN_ON_ONCE((off < PT_REG(r0)) && > + !test_ti_thread_flag(task_thread_info(task), > + TIF_ALLREGS_SAVED))) > + addr = &zero; > + else > + addr = task_stack_page(task) + off; > } > return addr; > } > @@ -145,13 +151,16 @@ get_reg(struct task_struct * task, unsigned long regno) > static int > put_reg(struct task_struct *task, unsigned long regno, unsigned long data) > { > + unsigned long *addr; > if (regno == 63) { > task_thread_info(task)->ieee_state > = ((task_thread_info(task)->ieee_state & ~IEEE_SW_MASK) > | (data & IEEE_SW_MASK)); > data = (data & FPCR_DYN_MASK) | ieee_swcr_to_fpcr(data); > } > - *get_reg_addr(task, regno) = data; > + addr = get_reg_addr(task, regno); > + if (addr != &zero) > + *addr = data; > return 0; > } > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-21 2:01 ` Michael Schmitz @ 2021-06-21 2:17 ` Linus Torvalds 2021-06-21 3:18 ` Michael Schmitz 2021-06-21 2:27 ` Al Viro 1 sibling, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-21 2:17 UTC (permalink / raw) To: Michael Schmitz Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Sun, Jun 20, 2021 at 7:01 PM Michael Schmitz <schmitzmic@gmail.com> wrote: > > instrumenting get_reg on m68k and using a similar patch to yours to warn > when unsaved registers are accessed on the switch stack, I get a hit > from getegid and getegid32, just by running a simple ptrace on ls. > > Going to wack those two moles now ... I don't see what's going on. Those system calls don't use the register state, afaik. What's the call chain, exactly? Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-21 2:17 ` Linus Torvalds @ 2021-06-21 3:18 ` Michael Schmitz 2021-06-21 3:37 ` Linus Torvalds 2021-06-21 3:44 ` Al Viro 0 siblings, 2 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-21 3:18 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Linus, Am 21.06.2021 um 14:17 schrieb Linus Torvalds: > On Sun, Jun 20, 2021 at 7:01 PM Michael Schmitz <schmitzmic@gmail.com> wrote: >> >> instrumenting get_reg on m68k and using a similar patch to yours to warn >> when unsaved registers are accessed on the switch stack, I get a hit >> from getegid and getegid32, just by running a simple ptrace on ls. >> >> Going to wack those two moles now ... > > I don't see what's going on. Those system calls don't use the register > state, afaik. What's the call chain, exactly? This is what I get from WARN_ONCE: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1177 at arch/m68k/kernel/ptrace.c:91 get_reg+0x90/0xb8 Modules linked in: CPU: 0 PID: 1177 Comm: strace Not tainted 5.13.0-rc1-atari-fpuemu-exitfix+ #1146 Stack from 014b7f04: 014b7f04 00336401 00336401 000278f0 0032c015 0000005b 00000005 0002795a 0032c015 0000005b 0000338c 00000009 00000000 00000000 ffffffe4 00000005 00000003 00000014 00000003 00000014 efc2b90c 0000338c 0032c015 0000005b 00000009 00000000 efc2b908 00912540 efc2b908 000034cc 00912540 00000005 00000000 efc2b908 00000003 00912540 8000110c c010b0a4 efc2b90c 0002d1d8 00912540 00000003 00000014 efc2b908 0000049a 00000014 efc2b908 800acaa8 Call Trace: [<000278f0>] __warn+0x9e/0xb4 [<0002795a>] warn_slowpath_fmt+0x54/0x62 [<0000338c>] get_reg+0x90/0xb8 [<0000338c>] get_reg+0x90/0xb8 [<000034cc>] arch_ptrace+0x7e/0x250 [<0002d1d8>] sys_ptrace+0x232/0x2f8 [<00002ab6>] syscall+0x8/0xc [<0000c00b>] lower+0x7/0x20 ---[ end trace ee4be53b94695793 ]--- Syscall numbers are actually 90 and 192 - sys_old_mmap and sys_mmap2 on m68k. Used the calculator on my Ubuntu desktop, that appears to be a little confused about hex to decimal conversions. I hope that makes more sense? Cheers, Michael > > Linus > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-21 3:18 ` Michael Schmitz @ 2021-06-21 3:37 ` Linus Torvalds 2021-06-21 4:08 ` Michael Schmitz 2021-06-21 3:44 ` Al Viro 1 sibling, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-21 3:37 UTC (permalink / raw) To: Michael Schmitz Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Sun, Jun 20, 2021 at 8:18 PM Michael Schmitz <schmitzmic@gmail.com> wrote: > > I hope that makes more sense? So the problem is in your debug patch: you don't set that TIS_SWITCH_STACK in nearly enough places. In this particular example, I think it's that you don't set it in do_trace_exit, so when you strace the process, the system call exit - which is where the return value will be picked up - gets that warning. You did set TIS_SWITCH_STACK on trace_entry, but then it's cleared again during the system call, and not set at the trace_exit path. Oddly, your debug patch also _clears_ it on the exit path, but it doesn't set it when do_trace_exit does the SAVE_SWITCH_STACK. You oddly also set it for __sys_exit, but not all the other special system calls that also do that SAVE_SWITCH_STACK. Really, pretty much every single case of SAVE_SWITCH_STACK would need to set it. Not just do_trace_enter/exit It's why I didn't like Eric's debug patch either. It's quite expensive to do, partly because you look up that curptr thing. All very nasty. It would be *much* better to make the flag be part of the stack frame, but sadly at least on alpha we had exported the format of that stack frame to user space. Anyway, I think these debug patches are not just expensive but the m68k one most definitely is also very incomplete. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-21 3:37 ` Linus Torvalds @ 2021-06-21 4:08 ` Michael Schmitz 0 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-21 4:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Linus, I realized that the patch is still incomplete when answering Al... Am 21.06.2021 um 15:37 schrieb Linus Torvalds: > On Sun, Jun 20, 2021 at 8:18 PM Michael Schmitz <schmitzmic@gmail.com> wrote: >> >> I hope that makes more sense? > > So the problem is in your debug patch: you don't set that > TIS_SWITCH_STACK in nearly enough places. > > In this particular example, I think it's that you don't set it in > do_trace_exit, so when you strace the process, the system call exit - > which is where the return value will be picked up - gets that warning. > > You did set TIS_SWITCH_STACK on trace_entry, but then it's cleared > again during the system call, and not set at the trace_exit path. > Oddly, your debug patch also _clears_ it on the exit path, but it > doesn't set it when do_trace_exit does the SAVE_SWITCH_STACK. > > You oddly also set it for __sys_exit, but not all the other special > system calls that also do that SAVE_SWITCH_STACK. That's the one I used to test whether my debug patch had any ill side effects (i.e. smashing the stack) late yesterday. Forgot to add that to the other cases. > > Really, pretty much every single case of SAVE_SWITCH_STACK would need > to set it. Not just do_trace_enter/exit Yes - done that now and the warning is gone. > It's why I didn't like Eric's debug patch either. It's quite expensive > to do, partly because you look up that curptr thing. All very nasty. I need to talk to Geert and Andreas to find where register a1 is preserved, but if I have to reload a1 all the time, this won't be useful except for debugging. > It would be *much* better to make the flag be part of the stack frame, > but sadly at least on alpha we had exported the format of that stack > frame to user space. Same on m68k, but can we push a flag _after_ the switch stack? > Anyway, I think these debug patches are not just expensive but the > m68k one most definitely is also very incomplete. Yes, I've seen that in the meantime. Need to triple check my work next time. Sorry for the extra noise! Cheers, Michael > > Linus > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-21 3:18 ` Michael Schmitz 2021-06-21 3:37 ` Linus Torvalds @ 2021-06-21 3:44 ` Al Viro 2021-06-21 5:31 ` Michael Schmitz 1 sibling, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-21 3:44 UTC (permalink / raw) To: Michael Schmitz Cc: Linus Torvalds, Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 03:18:35PM +1200, Michael Schmitz wrote: > This is what I get from WARN_ONCE: > > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 1177 at arch/m68k/kernel/ptrace.c:91 get_reg+0x90/0xb8 > Modules linked in: > CPU: 0 PID: 1177 Comm: strace Not tainted 5.13.0-rc1-atari-fpuemu-exitfix+ > #1146 > Stack from 014b7f04: > 014b7f04 00336401 00336401 000278f0 0032c015 0000005b 00000005 > 0002795a > 0032c015 0000005b 0000338c 00000009 00000000 00000000 ffffffe4 > 00000005 > 00000003 00000014 00000003 00000014 efc2b90c 0000338c 0032c015 > 0000005b > 00000009 00000000 efc2b908 00912540 efc2b908 000034cc 00912540 > 00000005 > 00000000 efc2b908 00000003 00912540 8000110c c010b0a4 efc2b90c > 0002d1d8 > 00912540 00000003 00000014 efc2b908 0000049a 00000014 efc2b908 > 800acaa8 > Call Trace: [<000278f0>] __warn+0x9e/0xb4 > [<0002795a>] warn_slowpath_fmt+0x54/0x62 > [<0000338c>] get_reg+0x90/0xb8 > [<0000338c>] get_reg+0x90/0xb8 > [<000034cc>] arch_ptrace+0x7e/0x250 > [<0002d1d8>] sys_ptrace+0x232/0x2f8 > [<00002ab6>] syscall+0x8/0xc > [<0000c00b>] lower+0x7/0x20 > > ---[ end trace ee4be53b94695793 ]--- > > Syscall numbers are actually 90 and 192 - sys_old_mmap and sys_mmap2 on > m68k. Used the calculator on my Ubuntu desktop, that appears to be a little > confused about hex to decimal conversions. > > I hope that makes more sense? Not really; what is the condition you are checking? The interesting trace is not that with get_reg() - it's that of the process being traced. You are not accessing the stack of caller of ptrace(2) here, so you want to know that SAVE_SWITCH_STACK had been done by the tracee, not tracer. And if that had been strace ls, you have TIF_SYSCALL_TRACE set for ls, so * ls hits system_call * notices TIF_SYSCALL_TRACE and goes to do_trace_entry * does SAVE_SWITCH_STACK there * calls syscall_trace(), which calls ptrace_notify() * ptrace_notify() calls ptrace_do_notify(), which calls ptrace_stop() * ptrace_stop() arranges for tracer to be woken up and gives CPU up, with TASK_TRACED as process state. That's the callchain in ls, and switch_stack accessed by get_reg() from strace is the one on ls(1) stack created by SAVE_SWITCH_STACK. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-21 3:44 ` Al Viro @ 2021-06-21 5:31 ` Michael Schmitz 0 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-21 5:31 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Al, Am 21.06.2021 um 15:44 schrieb Al Viro: > On Mon, Jun 21, 2021 at 03:18:35PM +1200, Michael Schmitz wrote: > >> This is what I get from WARN_ONCE: >> >> ------------[ cut here ]------------ >> WARNING: CPU: 0 PID: 1177 at arch/m68k/kernel/ptrace.c:91 get_reg+0x90/0xb8 >> Modules linked in: >> CPU: 0 PID: 1177 Comm: strace Not tainted 5.13.0-rc1-atari-fpuemu-exitfix+ >> #1146 >> Stack from 014b7f04: >> 014b7f04 00336401 00336401 000278f0 0032c015 0000005b 00000005 >> 0002795a >> 0032c015 0000005b 0000338c 00000009 00000000 00000000 ffffffe4 >> 00000005 >> 00000003 00000014 00000003 00000014 efc2b90c 0000338c 0032c015 >> 0000005b >> 00000009 00000000 efc2b908 00912540 efc2b908 000034cc 00912540 >> 00000005 >> 00000000 efc2b908 00000003 00912540 8000110c c010b0a4 efc2b90c >> 0002d1d8 >> 00912540 00000003 00000014 efc2b908 0000049a 00000014 efc2b908 >> 800acaa8 >> Call Trace: [<000278f0>] __warn+0x9e/0xb4 >> [<0002795a>] warn_slowpath_fmt+0x54/0x62 >> [<0000338c>] get_reg+0x90/0xb8 >> [<0000338c>] get_reg+0x90/0xb8 >> [<000034cc>] arch_ptrace+0x7e/0x250 >> [<0002d1d8>] sys_ptrace+0x232/0x2f8 >> [<00002ab6>] syscall+0x8/0xc >> [<0000c00b>] lower+0x7/0x20 >> >> ---[ end trace ee4be53b94695793 ]--- >> >> Syscall numbers are actually 90 and 192 - sys_old_mmap and sys_mmap2 on >> m68k. Used the calculator on my Ubuntu desktop, that appears to be a little >> confused about hex to decimal conversions. >> >> I hope that makes more sense? > > Not really; what is the condition you are checking? The interesting trace The check in get_reg() is: if (WARN_ON_ONCE((off < PT_REG(d1)) && test_ti_thread_status(task_thread_info(task),TIS_TRACING) && !test_ti_thread_status(task_thread_info(task), TIS_ALLREGS_SAVED))) { unsigned long *addr_d0; addr_d0 = (unsigned long *)(task->thread.esp0 + regoff[16]); pr_err("get_reg with incomplete stack, regno %d offs %d orig_d0 %lx\n", regno, off, *addr_d0); return 0; } > is not that with get_reg() - it's that of the process being traced. You > are not accessing the stack of caller of ptrace(2) here, so you want to > know that SAVE_SWITCH_STACK had been done by the tracee, not tracer. > > And if that had been strace ls, you have TIF_SYSCALL_TRACE set for ls, so > * ls hits system_call > * notices TIF_SYSCALL_TRACE and goes to do_trace_entry > * does SAVE_SWITCH_STACK there ... and sets both the new TIS_TRACING and TIS_ALLREGS_SAVED flags in the thread_info->status field (now that I've corrected my patch). > * calls syscall_trace(), which calls ptrace_notify() > * ptrace_notify() calls ptrace_do_notify(), which calls ptrace_stop() > * ptrace_stop() arranges for tracer to be woken up and gives CPU up, > with TASK_TRACED as process state. Thanks for explaining! So in order to get a trace for the process being traced, I would have to check the TIS_ALLREGS_SAVED in ptrace_stop()? > That's the callchain in ls, and switch_stack accessed by get_reg() from > strace is the one on ls(1) stack created by SAVE_SWITCH_STACK. So testing for TIS_ALLREGS_SAVED in get_reg() (called by the tracer, but with the tracee's task struct passed to arch_ptrace()) does check that SAVE_SWITCH_STACK was done before the syscall in the tracee, right? Anyway, I'd missed setting the flags for some crucial SAVE_SWITCH_STACK operations in my woefully incomplete patch. With that corrected, there's no more warning from mmap. I'll try with a more recent version of strace and gdb once I've updated my test image. Cheers, Michael ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-21 2:01 ` Michael Schmitz 2021-06-21 2:17 ` Linus Torvalds @ 2021-06-21 2:27 ` Al Viro 2021-06-21 3:36 ` Michael Schmitz 1 sibling, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-21 2:27 UTC (permalink / raw) To: Michael Schmitz Cc: Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 02:01:18PM +1200, Michael Schmitz wrote: > Hi Eric, > > instrumenting get_reg on m68k and using a similar patch to yours to warn > when unsaved registers are accessed on the switch stack, I get a hit from > getegid and getegid32, just by running a simple ptrace on ls. > > Going to wack those two moles now ... Explain, please. get_reg() is called by tracer; whose state are you checking? Because you are *not* accessing the switch stack of the caller of get_reg(). And tracee should be in something like syscall_trace() or do_notify_resume(); both have SAVE_SWITCH_STACK done by the glue... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack 2021-06-21 2:27 ` Al Viro @ 2021-06-21 3:36 ` Michael Schmitz 0 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-21 3:36 UTC (permalink / raw) To: Al Viro Cc: Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Al, Am 21.06.2021 um 14:27 schrieb Al Viro: > On Mon, Jun 21, 2021 at 02:01:18PM +1200, Michael Schmitz wrote: >> Hi Eric, >> >> instrumenting get_reg on m68k and using a similar patch to yours to warn >> when unsaved registers are accessed on the switch stack, I get a hit from >> getegid and getegid32, just by running a simple ptrace on ls. >> >> Going to wack those two moles now ... > > Explain, please. get_reg() is called by tracer; whose state are you checking? The check is only triggered when syscall tracing (I set a flag on trace entry, and clear that on trace exit)... From the WARN_ONCE stack dump, it appears that I get the warning from inside the syscall, not syscall_trace(). > Because you are *not* accessing the switch stack of the caller of get_reg(). > And tracee should be in something like syscall_trace() or do_notify_resume(); > both have SAVE_SWITCH_STACK done by the glue... And that's where my problem may be - I stupidly forgot to set the 'all registers saved' flag before calling syscall_trace() ... I'll fix that and try again. Sorry for the noise! Cheers, Michael > ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames 2021-06-16 18:29 ` [PATCH 0/2] alpha/ptrace: Improved switch_stack handling Eric W. Biederman 2021-06-16 18:31 ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman @ 2021-06-16 18:32 ` Eric W. Biederman 2021-06-16 20:25 ` Al Viro 1 sibling, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 18:32 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook With the introduction of PTRACE_EVENT_XXX flags during the 2.5 development cycle it became possible for ptrace to write arbitrary data to alpha kernel stack frames because it was assumed that wherever ptrace_stop was called both a pt_regs and a switch_stack were saved upon the stack. The introduction of TIF_ALLREGS_SAVED removed the assumption that switch_stack was saved on the kernel thread by transforming the problem into a lesser bug where the access simply don't work. Saving struct switch_stack has to happen on the lowest level of the stack on alpha because it contains caller saved registers, which will be saved by the C code in arbitrary locations on the stack if the data is not saved immediately. Update kernel threads to save a full set of userspace registers on the stack so that io_uring threads can be ptraced. Update fork, vfork, clone, exit, exit_group, execve, and execveat to save all of the userspace registers when the are called as there are known PTRACE_EVENT_XXX ptrace stop points in those functions where registers can be manipulated. The switch_stack frames serve double duty in fork, vfork, and clone, as both the the childs inputs to alpha_switch_to, and the parents saved copy of the registers for debuggers to modify. This changes marks the the frame is present in the parent, and clears TIF_ALLREGS_SAVED in the child as alpha_switch_to will consume the switch_stack when the child is started. Cc: stable@vger.kernel.org Inspired-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: dbe1bdbb39db ("io_uring: handle signals for IO threads like a normal thread") Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism") History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- arch/alpha/kernel/entry.S | 24 +++++++++++++++++------- arch/alpha/kernel/process.c | 3 +++ arch/alpha/kernel/syscalls/syscall.tbl | 8 ++++---- 3 files changed, 24 insertions(+), 11 deletions(-) diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S index c1edf54dc035..f29a40e2daf1 100644 --- a/arch/alpha/kernel/entry.S +++ b/arch/alpha/kernel/entry.S @@ -801,13 +801,18 @@ ret_from_fork: .align 4 .globl ret_from_kernel_thread .ent ret_from_kernel_thread + .cfi_startproc ret_from_kernel_thread: mov $17, $16 jsr $26, schedule_tail + /* PF_IO_WORKER threads can be ptraced */ + SAVE_SWITCH_STACK mov $9, $27 mov $10, $16 jsr $26, ($9) + RESTORE_SWITCH_STACK br $31, ret_to_user + .cfi_endproc .end ret_from_kernel_thread \f @@ -816,23 +821,28 @@ ret_from_kernel_thread: * have to play switch_stack games. */ -.macro fork_like name +.macro allregs name .align 4 .globl alpha_\name .ent alpha_\name + .cfi_startproc alpha_\name: .prologue 0 - bsr $1, do_switch_stack + SAVE_SWITCH_STACK jsr $26, sys_\name - ldq $26, 56($sp) - lda $sp, SWITCH_STACK_SIZE($sp) + RESTORE_SWITCH_STACK ret + .cfi_endproc .end alpha_\name .endm -fork_like fork -fork_like vfork -fork_like clone +allregs fork +allregs vfork +allregs clone +allregs exit +allregs exit_group +allregs execve +allregs execveat .macro sigreturn_like name .align 4 diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c index 5112ab996394..3bf480468a89 100644 --- a/arch/alpha/kernel/process.c +++ b/arch/alpha/kernel/process.c @@ -249,6 +249,9 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, childti->pcb.ksp = (unsigned long) childstack; childti->pcb.flags = 1; /* set FEN, clear everything else */ + /* In the child the registers are consumed by alpha_switch_to */ + clear_ti_thread_flag(childti, TIF_ALLREGS_SAVED); + if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) { /* kernel thread */ memset(childstack, 0, diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 3000a2e8ee21..5f85f3c11ed4 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -8,7 +8,7 @@ # The <abi> is always "common" for this file # 0 common osf_syscall alpha_syscall_zero -1 common exit sys_exit +1 common exit alpha_exit 2 common fork alpha_fork 3 common read sys_read 4 common write sys_write @@ -65,7 +65,7 @@ 56 common osf_revoke sys_ni_syscall 57 common symlink sys_symlink 58 common readlink sys_readlink -59 common execve sys_execve +59 common execve alpha_execve 60 common umask sys_umask 61 common chroot sys_chroot 62 common osf_old_fstat sys_ni_syscall @@ -333,7 +333,7 @@ 400 common io_getevents sys_io_getevents 401 common io_submit sys_io_submit 402 common io_cancel sys_io_cancel -405 common exit_group sys_exit_group +405 common exit_group alpha_exit_group 406 common lookup_dcookie sys_lookup_dcookie 407 common epoll_create sys_epoll_create 408 common epoll_ctl sys_epoll_ctl @@ -441,7 +441,7 @@ 510 common renameat2 sys_renameat2 511 common getrandom sys_getrandom 512 common memfd_create sys_memfd_create -513 common execveat sys_execveat +513 common execveat alpha_execveat 514 common seccomp sys_seccomp 515 common bpf sys_bpf 516 common userfaultfd sys_userfaultfd -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames 2021-06-16 18:32 ` [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames Eric W. Biederman @ 2021-06-16 20:25 ` Al Viro 2021-06-16 20:28 ` Al Viro 2021-06-16 20:47 ` Eric W. Biederman 0 siblings, 2 replies; 119+ messages in thread From: Al Viro @ 2021-06-16 20:25 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 01:32:50PM -0500, Eric W. Biederman wrote: > -.macro fork_like name > +.macro allregs name > .align 4 > .globl alpha_\name > .ent alpha_\name > + .cfi_startproc > alpha_\name: > .prologue 0 > - bsr $1, do_switch_stack > + SAVE_SWITCH_STACK > jsr $26, sys_\name > - ldq $26, 56($sp) > - lda $sp, SWITCH_STACK_SIZE($sp) > + RESTORE_SWITCH_STACK No. You've just added one hell of an overhead to fork(2), for no reason whatsoever. sys_fork() et.al. does *NOT* modify the callee-saved registers; it's plain C. So this change is complete BS. > +allregs exit > +allregs exit_group Details, please - what exactly makes exit(2) different from e.g. open(2)? ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames 2021-06-16 20:25 ` Al Viro @ 2021-06-16 20:28 ` Al Viro 2021-06-16 20:49 ` Eric W. Biederman 2021-06-16 20:47 ` Eric W. Biederman 1 sibling, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-16 20:28 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 08:25:35PM +0000, Al Viro wrote: > On Wed, Jun 16, 2021 at 01:32:50PM -0500, Eric W. Biederman wrote: > > > -.macro fork_like name > > +.macro allregs name > > .align 4 > > .globl alpha_\name > > .ent alpha_\name > > + .cfi_startproc > > alpha_\name: > > .prologue 0 > > - bsr $1, do_switch_stack > > + SAVE_SWITCH_STACK > > jsr $26, sys_\name > > - ldq $26, 56($sp) > > - lda $sp, SWITCH_STACK_SIZE($sp) > > + RESTORE_SWITCH_STACK > > No. You've just added one hell of an overhead to fork(2), > for no reason whatsoever. sys_fork() et.al. does *NOT* modify the > callee-saved registers; it's plain C. So this change is complete > BS. > > > +allregs exit > > +allregs exit_group > > Details, please - what exactly makes exit(2) different from > e.g. open(2)? Ah... PTRACE_EVENT_EXIT garbage, fortunately having no counterparts in case of open(2)... Still, WTF would you want to restore callee-saved registers for in case of exit(2)? ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames 2021-06-16 20:28 ` Al Viro @ 2021-06-16 20:49 ` Eric W. Biederman 2021-06-16 20:54 ` Al Viro 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 20:49 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Al Viro <viro@zeniv.linux.org.uk> writes: > On Wed, Jun 16, 2021 at 08:25:35PM +0000, Al Viro wrote: >> On Wed, Jun 16, 2021 at 01:32:50PM -0500, Eric W. Biederman wrote: >> >> > -.macro fork_like name >> > +.macro allregs name >> > .align 4 >> > .globl alpha_\name >> > .ent alpha_\name >> > + .cfi_startproc >> > alpha_\name: >> > .prologue 0 >> > - bsr $1, do_switch_stack >> > + SAVE_SWITCH_STACK >> > jsr $26, sys_\name >> > - ldq $26, 56($sp) >> > - lda $sp, SWITCH_STACK_SIZE($sp) >> > + RESTORE_SWITCH_STACK >> >> No. You've just added one hell of an overhead to fork(2), >> for no reason whatsoever. sys_fork() et.al. does *NOT* modify the >> callee-saved registers; it's plain C. So this change is complete >> BS. >> >> > +allregs exit >> > +allregs exit_group >> >> Details, please - what exactly makes exit(2) different from >> e.g. open(2)? > > Ah... PTRACE_EVENT_EXIT garbage, fortunately having no counterparts in case of > open(2)... Still, WTF would you want to restore callee-saved registers for > in case of exit(2)? Someone might want or try to read them in the case of exit. Which without some change will result in a read of other kernel stack content on alpha. Plus there are coredumps which definitely want to read everything. Although admittedly that case no longer matters. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames 2021-06-16 20:49 ` Eric W. Biederman @ 2021-06-16 20:54 ` Al Viro 0 siblings, 0 replies; 119+ messages in thread From: Al Viro @ 2021-06-16 20:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 03:49:44PM -0500, Eric W. Biederman wrote: > Someone might want or try to read them in the case of exit. Which > without some change will result in a read of other kernel stack content > on alpha. And someone might want a pony. Again, why bother restoring those, _especially_ in case of exit(2)? > Plus there are coredumps which definitely want to read everything. Huh? In case of coredump we are going to have come through $work_notifysig: mov $sp, $16 DO_SWITCH_STACK jsr $26, do_work_pending so they *do* have full pt_regs saved. What's the problem? ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames 2021-06-16 20:25 ` Al Viro 2021-06-16 20:28 ` Al Viro @ 2021-06-16 20:47 ` Eric W. Biederman 2021-06-16 20:55 ` Al Viro 1 sibling, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 20:47 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Al Viro <viro@zeniv.linux.org.uk> writes: > On Wed, Jun 16, 2021 at 01:32:50PM -0500, Eric W. Biederman wrote: > >> -.macro fork_like name >> +.macro allregs name >> .align 4 >> .globl alpha_\name >> .ent alpha_\name >> + .cfi_startproc >> alpha_\name: >> .prologue 0 >> - bsr $1, do_switch_stack >> + SAVE_SWITCH_STACK >> jsr $26, sys_\name >> - ldq $26, 56($sp) >> - lda $sp, SWITCH_STACK_SIZE($sp) >> + RESTORE_SWITCH_STACK > > No. You've just added one hell of an overhead to fork(2), > for no reason whatsoever. sys_fork() et.al. does *NOT* modify the > callee-saved registers; it's plain C. So this change is complete > BS. Fork already saves the registers, all I did was restore them. Which makes a debugger that modifies them in PTRACE_EVENT_{FORK,VFORK,CLONE,VFORK_DONE} work. >> +allregs exit >> +allregs exit_group > > Details, please - what exactly makes exit(2) different from > e.g. open(2)? PTRACE_EVENT_EXIT. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames 2021-06-16 20:47 ` Eric W. Biederman @ 2021-06-16 20:55 ` Al Viro 0 siblings, 0 replies; 119+ messages in thread From: Al Viro @ 2021-06-16 20:55 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Wed, Jun 16, 2021 at 03:47:28PM -0500, Eric W. Biederman wrote: > Fork already saves the registers, all I did was restore them. Which > makes a debugger that modifies them in > PTRACE_EVENT_{FORK,VFORK,CLONE,VFORK_DONE} work. ... first time ever. Wonderful and well worth the overhead. </sarcasm> ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads 2021-06-15 19:36 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman 2021-06-15 22:02 ` Linus Torvalds @ 2021-06-16 20:50 ` Al Viro 1 sibling, 0 replies; 119+ messages in thread From: Al Viro @ 2021-06-16 20:50 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Tue, Jun 15, 2021 at 02:36:38PM -0500, Eric W. Biederman wrote: > > While thinking about the information leaks fixed in 77f6ab8b7768 > ("don't dump the threads that had been already exiting when zapped.") > I realized the problem is much more general than just coredumps and > exit_mm. We have io_uring threads, PTRACE_EVENT_EXEC and > PTRACE_EVENT_EXIT where ptrace is allowed to access userspace > registers, but on some architectures has not saved them. Wait a sec. To have anything happen on PTRACE_EVENT_EXEC, you need the fucker traced. *IF* you want to go that way, at least make it conditional upon the same thing. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-15 19:30 ` Eric W. Biederman 2021-06-15 19:36 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman @ 2021-06-15 20:56 ` Michael Schmitz 2021-06-16 0:23 ` Finn Thain 2021-06-15 21:58 ` Linus Torvalds 2021-06-16 7:38 ` Geert Uytterhoeven 3 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-15 20:56 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, John Paul Adrian Glaubitz Hi Eric, On 16/06/21 7:30 am, Eric W. Biederman wrote: > >>> The io_uring tasks are special in that they are user process >>> threads that never run in userspace. So as long as everything >>> ptrace can read is accessible on that process all is well. >> OK, I'm testing a patch that would save extra context in sys_io_uring_setup, >> which ought to ensure that for m68k. > I had to update ret_from_kernel_thread to pop that state to get Linus's > change to boot. Apparently kernel_threads exiting needs to be handled. Hadn't yet got to that stage, sorry. Still stress testing stage 1 of my fix (push complete context). I would have thought that this should be sufficient (gives us a complete stack frame for ptrace code to work on)? But it makes sense that when you push an extra stack frame, you'd need to pop that on exit. > >>> Having stared a bit longer at the code I think the short term >>> fix for both of PTRACE_EVENT_EXIT and io_uring is to guard >>> them both with CONFIG_HAVE_ARCH_TRACEHOOK. > Which does not work because nios2 which looks susceptible > sets CONFIG_HAVE_ARCH_TRACEHOOK. > > A further look shows that there is also PTRACE_EVENT_EXEC that > needs to be handled so execve and execveat need to be wrapped > as well. > > Do you happen to know if there is userspace that will run > in qemu-system-m68k that can be used for testing? I surmise so. I don't use qemu myself - either ARAnyM, or actual hardware. Hardware is limited to 14 MB RAM, which has prevented me from using more than simple regression testing. In particular, I can't test sys_io_uring_setup there. Adrian uses qemu a lot, and has supplied disk images to work from on occasion. Maybe he's got something recent enough to support sys_io_uring_setup ... I've CC:ed him in, as I'd love to do some more testing as well. Cheers, Michael > > Eric > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-15 20:56 ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Michael Schmitz @ 2021-06-16 0:23 ` Finn Thain 0 siblings, 0 replies; 119+ messages in thread From: Finn Thain @ 2021-06-16 0:23 UTC (permalink / raw) To: Eric W. Biederman, Michael Schmitz Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, John Paul Adrian Glaubitz On Wed, 16 Jun 2021, Michael Schmitz wrote: > > > > Do you happen to know if there is userspace that will run in > > qemu-system-m68k that can be used for testing? > > I surmise so. I don't use qemu myself - either ARAnyM, or actual > hardware. Hardware is limited to 14 MB RAM, which has prevented me from > using more than simple regression testing. In particular, I can't test > sys_io_uring_setup there. > > Adrian uses qemu a lot, and has supplied disk images to work from on > occasion. Maybe he's got something recent enough to support > sys_io_uring_setup ... I've CC:ed him in, as I'd love to do some more > testing as well. > As well as Debian/m68k, there is also a Gentoo/m68k stage3 rootfs available here: https://sourceforge.net/projects/linux-mac68k/files/Gentoo%20m68k%20unauthorized/ I built that rootfs last year using Catalyst. Some background (including the qemu-system-m68k command-line) can be found here: https://forums.gentoo.org/viewtopic-t-1100780.html ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-15 19:30 ` Eric W. Biederman 2021-06-15 19:36 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman 2021-06-15 20:56 ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Michael Schmitz @ 2021-06-15 21:58 ` Linus Torvalds 2021-06-16 15:06 ` Eric W. Biederman 2021-06-21 13:54 ` Al Viro 2021-06-16 7:38 ` Geert Uytterhoeven 3 siblings, 2 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-15 21:58 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Tue, Jun 15, 2021 at 12:32 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > I had to update ret_from_kernel_thread to pop that state to get Linus's > change to boot. Apparently kernel_threads exiting needs to be handled. You are very right. That, btw, seems to be a horrible design mistake, but I think it's how "kernel_execve()" works - both for the initial "init", but also for user-mode helper processes. Both of those cases do "kernel_thread()" to create a new thread, and then that new kernel thread does kernel_execve() to create the user space image for that thread. And that act of "execve()" clears PF_KTHREAD from the thread, and then the final return from the kernel thread function returns to that new user space. Or something like that. It's been ages since I looked at that code, and your patch initially confused the heck out of me because I went "that can't _possibly_ be needed". But yes, I think your patch is right. And I think our horrible "kernel threads return to user space when done" is absolutely horrifically nasty. Maybe of the clever sort, but mostly of the historical horror sort. Or am I mis-rememberting how this ends up working? Did you look at exactly what it was that returned from kernel threads? This might be worth commenting on somewhere. But your patch for alpha looks correct to me. Did you have some test-case to verify ptrace() on io worker threads? Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-15 21:58 ` Linus Torvalds @ 2021-06-16 15:06 ` Eric W. Biederman 2021-06-21 13:54 ` Al Viro 1 sibling, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-16 15:06 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, Jun 15, 2021 at 12:32 PM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> I had to update ret_from_kernel_thread to pop that state to get Linus's >> change to boot. Apparently kernel_threads exiting needs to be handled. > > You are very right. > > That, btw, seems to be a horrible design mistake, but I think it's how > "kernel_execve()" works - both for the initial "init", but also for > user-mode helper processes. > > Both of those cases do "kernel_thread()" to create a new thread, and > then that new kernel thread does kernel_execve() to create the user > space image for that thread. And that act of "execve()" clears > PF_KTHREAD from the thread, and then the final return from the kernel > thread function returns to that new user space. > > Or something like that. It's been ages since I looked at that code, > and your patch initially confused the heck out of me because I went > "that can't _possibly_ be needed". > > But yes, I think your patch is right. > > And I think our horrible "kernel threads return to user space when > done" is absolutely horrifically nasty. Maybe of the clever sort, but > mostly of the historical horror sort. > > Or am I mis-rememberting how this ends up working? Did you look at > exactly what it was that returned from kernel threads? > > This might be worth commenting on somewhere. But your patch for alpha > looks correct to me. Did you have some test-case to verify ptrace() on > io worker threads? At this point I just booted an alpha image and on qemu-system-alpha. I do have gdb in my kernel image so I can test that. I haven't yet but I can and should. Sleeping on it I came up with a plan to add TF_SWITCH_STACK_SAVED to indicate if the registers have been saved. The DO_SWITCH_STACK and UNDO_SWITCH_STACK helpers (except in alpha_switch_to) can test that. The ptrace helpers can test that and turn an access of random kernel stack contents into something well behaved that does WARN_ON_ONCE because we should not get there. I suspect adding TF_SWITCH_STACK_SAVED should come first so it is easy to verify the problem behavior, before I fix it. My real goal is to find a pattern that architectures whose register saves are structured like alphas can emulate, to minimize problems in the future. Plus I would really like to get the last handful of architectures updated so we can remove CONFIG_HAVE_ARCH_TRACEHOOK. I think we can do that on alpha because we save all of the system call arguments in pt_regs and that is all the other non-ptrace code paths care about. AKA I am trying to move the old architectures forward so we can get rid of unnecessary complications in the core code. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-15 21:58 ` Linus Torvalds 2021-06-16 15:06 ` Eric W. Biederman @ 2021-06-21 13:54 ` Al Viro 2021-06-21 14:16 ` Al Viro ` (2 more replies) 1 sibling, 3 replies; 119+ messages in thread From: Al Viro @ 2021-06-21 13:54 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: > And I think our horrible "kernel threads return to user space when > done" is absolutely horrifically nasty. Maybe of the clever sort, but > mostly of the historical horror sort. How would you prefer to handle that, then? Separate magical path from kernel_execve() to switch to userland? We used to have something of that sort, and that had been a real horror... As it is, it's "kernel thread is spawned at the point similar to ret_from_fork(), runs the payload (which almost never returns) and then proceeds out to userland, same way fork(2) would've done." That way kernel_execve() doesn't have to do anything magical. Al, digging through the old notes and current call graph... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 13:54 ` Al Viro @ 2021-06-21 14:16 ` Al Viro 2021-06-21 16:50 ` Eric W. Biederman 2021-06-21 15:38 ` Linus Torvalds 2021-06-21 18:59 ` Al Viro 2 siblings, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-21 14:16 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote: > On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: > > > And I think our horrible "kernel threads return to user space when > > done" is absolutely horrifically nasty. Maybe of the clever sort, but > > mostly of the historical horror sort. > > How would you prefer to handle that, then? Separate magical path from > kernel_execve() to switch to userland? We used to have something of > that sort, and that had been a real horror... > > As it is, it's "kernel thread is spawned at the point similar to > ret_from_fork(), runs the payload (which almost never returns) and > then proceeds out to userland, same way fork(2) would've done." > That way kernel_execve() doesn't have to do anything magical. > > Al, digging through the old notes and current call graph... FWIW, the major assumption back then had been that get_signal(), signal_delivered() and all associated machinery (including coredumps) runs *only* from SIGPENDING/NOTIFY_SIGNAL handling. And "has complete registers on stack" is only a part of that; there was other fun stuff in the area ;-/ Do we want coredumps for those, and if we do, will the de_thread stuff work there? ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 14:16 ` Al Viro @ 2021-06-21 16:50 ` Eric W. Biederman 2021-06-21 23:05 ` Al Viro 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-21 16:50 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Al Viro <viro@zeniv.linux.org.uk> writes: > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote: >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: >> >> > And I think our horrible "kernel threads return to user space when >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but >> > mostly of the historical horror sort. >> >> How would you prefer to handle that, then? Separate magical path from >> kernel_execve() to switch to userland? We used to have something of >> that sort, and that had been a real horror... >> >> As it is, it's "kernel thread is spawned at the point similar to >> ret_from_fork(), runs the payload (which almost never returns) and >> then proceeds out to userland, same way fork(2) would've done." >> That way kernel_execve() doesn't have to do anything magical. >> >> Al, digging through the old notes and current call graph... > > FWIW, the major assumption back then had been that get_signal(), > signal_delivered() and all associated machinery (including coredumps) > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling. > > And "has complete registers on stack" is only a part of that; > there was other fun stuff in the area ;-/ Do we want coredumps for > those, and if we do, will the de_thread stuff work there? Do we want coredumps from processes that use io_uring? yes Exactly what we want from io_uring threads is less clear. We can't really give much that is meaningful beyond the thread ids of the io_uring threads. What problems do are you seeing beyond the missing registers on the stack for kernel threads? I don't immediately see the connection between coredumps and de_thread. The function de_thread arranges for the fatal_signal_pending to be true, and that should work just fine for io_uring threads. The io_uring threads process the fatal_signal with get_signal and then proceed to exit eventually calling do_exit. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 16:50 ` Eric W. Biederman @ 2021-06-21 23:05 ` Al Viro 2021-06-22 16:39 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-21 23:05 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote: > Al Viro <viro@zeniv.linux.org.uk> writes: > > > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote: > >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: > >> > >> > And I think our horrible "kernel threads return to user space when > >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but > >> > mostly of the historical horror sort. > >> > >> How would you prefer to handle that, then? Separate magical path from > >> kernel_execve() to switch to userland? We used to have something of > >> that sort, and that had been a real horror... > >> > >> As it is, it's "kernel thread is spawned at the point similar to > >> ret_from_fork(), runs the payload (which almost never returns) and > >> then proceeds out to userland, same way fork(2) would've done." > >> That way kernel_execve() doesn't have to do anything magical. > >> > >> Al, digging through the old notes and current call graph... > > > > FWIW, the major assumption back then had been that get_signal(), > > signal_delivered() and all associated machinery (including coredumps) > > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling. > > > > And "has complete registers on stack" is only a part of that; > > there was other fun stuff in the area ;-/ Do we want coredumps for > > those, and if we do, will the de_thread stuff work there? > > Do we want coredumps from processes that use io_uring? yes > Exactly what we want from io_uring threads is less clear. We can't > really give much that is meaningful beyond the thread ids of the > io_uring threads. > > What problems do are you seeing beyond the missing registers on the > stack for kernel threads? > > I don't immediately see the connection between coredumps and de_thread. > > The function de_thread arranges for the fatal_signal_pending to be true, > and that should work just fine for io_uring threads. The io_uring > threads process the fatal_signal with get_signal and then proceed to > exit eventually calling do_exit. I would like to see the testing in cases when the io-uring thread is the one getting hit by initial signal and when it's the normal one with associated io-uring ones. The thread-collecting logics at least used to depend upon fairly subtle assumptions, and "kernel threads obviously can't show up as candidates" used to narrow the analysis down... In any case, WTF would we allow reads or writes to *any* registers of such threads? It's not as simple as "just return zeroes", BTW - the values allowed in special registers might have non-trivial constraints on them. The same goes for coredump - we don't _have_ registers to dump for those, period. Looks like the first things to do would be * prohibit ptrace accessing any regsets of worker threads * make coredump skip all register notes for those Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into ptrace_notify() - explicit CLONE_UNTRACED for the former and zero current->ptrace in the caller of the latter. So fork and exec side has ptrace_event() crap limited to real syscalls. It's seccomp[1] and exit-related stuff that are messy... [1] "never trust somebody who introduces himself as Honest Joe and keeps carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY, etc. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 23:05 ` Al Viro @ 2021-06-22 16:39 ` Eric W. Biederman 0 siblings, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-22 16:39 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Al Viro <viro@zeniv.linux.org.uk> writes: > On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote: >> Al Viro <viro@zeniv.linux.org.uk> writes: >> >> > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote: >> >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: >> >> >> >> > And I think our horrible "kernel threads return to user space when >> >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but >> >> > mostly of the historical horror sort. >> >> >> >> How would you prefer to handle that, then? Separate magical path from >> >> kernel_execve() to switch to userland? We used to have something of >> >> that sort, and that had been a real horror... >> >> >> >> As it is, it's "kernel thread is spawned at the point similar to >> >> ret_from_fork(), runs the payload (which almost never returns) and >> >> then proceeds out to userland, same way fork(2) would've done." >> >> That way kernel_execve() doesn't have to do anything magical. >> >> >> >> Al, digging through the old notes and current call graph... >> > >> > FWIW, the major assumption back then had been that get_signal(), >> > signal_delivered() and all associated machinery (including coredumps) >> > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling. >> > >> > And "has complete registers on stack" is only a part of that; >> > there was other fun stuff in the area ;-/ Do we want coredumps for >> > those, and if we do, will the de_thread stuff work there? >> >> Do we want coredumps from processes that use io_uring? yes >> Exactly what we want from io_uring threads is less clear. We can't >> really give much that is meaningful beyond the thread ids of the >> io_uring threads. >> >> What problems do are you seeing beyond the missing registers on the >> stack for kernel threads? >> >> I don't immediately see the connection between coredumps and de_thread. >> >> The function de_thread arranges for the fatal_signal_pending to be true, >> and that should work just fine for io_uring threads. The io_uring >> threads process the fatal_signal with get_signal and then proceed to >> exit eventually calling do_exit. > > I would like to see the testing in cases when the io-uring thread is > the one getting hit by initial signal and when it's the normal one > with associated io-uring ones. The thread-collecting logics at least > used to depend upon fairly subtle assumptions, and "kernel threads > obviously can't show up as candidates" used to narrow the analysis > down... > > In any case, WTF would we allow reads or writes to *any* registers of > such threads? It's not as simple as "just return zeroes", BTW - the > values allowed in special registers might have non-trivial constraints > on them. The same goes for coredump - we don't _have_ registers to > dump for those, period. > > Looks like the first things to do would be > * prohibit ptrace accessing any regsets of worker threads > * make coredump skip all register notes for those Skipping register notes is fine. Prohibiting ptrace access to any regsets of worker threads is interesting. I think that was tried and shown to confuse gdb. So the conclusion was just to provide a fake set of registers. Which has appears to work up to the point of dealing with architectures that have their magic caller-saved optimization (like alpha and m68k), and no check that all of the registers were saved when accessed. Adding a dummy switch stack frame for the kernel threads on those architectures looks like a good/cheap solution at first glance. > Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into > ptrace_notify() - explicit CLONE_UNTRACED for the former and zero > current->ptrace in the caller of the latter. So fork and exec side > has ptrace_event() crap limited to real syscalls. That is where I thought we were. Thanks for confirming that. > It's seccomp[1] and exit-related stuff that are messy... > > [1] "never trust somebody who introduces himself as Honest Joe and keeps > carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY, > etc. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 13:54 ` Al Viro 2021-06-21 14:16 ` Al Viro @ 2021-06-21 15:38 ` Linus Torvalds 2021-06-21 18:59 ` Al Viro 2 siblings, 0 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-21 15:38 UTC (permalink / raw) To: Al Viro Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 6:55 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: > > > And I think our horrible "kernel threads return to user space when > > done" is absolutely horrifically nasty. Maybe of the clever sort, but > > mostly of the historical horror sort. > > How would you prefer to handle that, then? Separate magical path from > kernel_execve() to switch to userland? We used to have something of > that sort, and that had been a real horror... Hmm. Maybe the alternatives would all be worse. The current thing is clever, and shares the return path with the normal case. It's just also a bit surprising, in that a kernel thread normally must not return - with the magical exception of "if it had done a kernel_execve() at some point, then returning is magically the way you actually start user mode". So it all feels very special, and there's not even a comment about it. I think we only have two users of that thing (the very first 'init', and user-mode-helpr), So I guess it doesn't really matter. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 13:54 ` Al Viro 2021-06-21 14:16 ` Al Viro 2021-06-21 15:38 ` Linus Torvalds @ 2021-06-21 18:59 ` Al Viro 2021-06-21 19:22 ` Linus Torvalds 2021-06-21 19:24 ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Al Viro 2 siblings, 2 replies; 119+ messages in thread From: Al Viro @ 2021-06-21 18:59 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote: > On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: > > > And I think our horrible "kernel threads return to user space when > > done" is absolutely horrifically nasty. Maybe of the clever sort, but > > mostly of the historical horror sort. > > How would you prefer to handle that, then? Separate magical path from > kernel_execve() to switch to userland? We used to have something of > that sort, and that had been a real horror... > > As it is, it's "kernel thread is spawned at the point similar to > ret_from_fork(), runs the payload (which almost never returns) and > then proceeds out to userland, same way fork(2) would've done." > That way kernel_execve() doesn't have to do anything magical. > > Al, digging through the old notes and current call graph... There's a large mess around do_exit() - we have a bunch of callers all over arch/*; if nothing else, I very much doubt that really want to let tracer play with a thread in the middle of die_if_kernel() or similar. We sure as hell do not want to arrange for anything on the kernel stack in such situations, no matter what's done in exit(2)... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 18:59 ` Al Viro @ 2021-06-21 19:22 ` Linus Torvalds 2021-06-21 19:45 ` Al Viro 2021-06-21 20:03 ` Eric W. Biederman 2021-06-21 19:24 ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Al Viro 1 sibling, 2 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-21 19:22 UTC (permalink / raw) To: Al Viro Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 11:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > There's a large mess around do_exit() - we have a bunch of > callers all over arch/*; if nothing else, I very much doubt that really > want to let tracer play with a thread in the middle of die_if_kernel() > or similar. Right you are. I'm really beginning to hate ptrace_{event,notify}() and those PTRACE_EVENT_xyz things. I don't even know what uses them, honestly. How very annoying. I guess it's easy enough (famous last words) to move the ptrace_event() call out of do_exit() and into the actual exit/exit_group system calls, and the signal handling path. The paths that actually have proper pt_regs. Looks like sys_exit() and do_group_exit() would be the two places to do it (do_group_exit() would handle the signal case and sys_group_exit()). Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 19:22 ` Linus Torvalds @ 2021-06-21 19:45 ` Al Viro 2021-06-21 23:14 ` Linus Torvalds 2021-06-21 20:03 ` Eric W. Biederman 1 sibling, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-21 19:45 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, Tetsuo Handa On Mon, Jun 21, 2021 at 12:22:06PM -0700, Linus Torvalds wrote: > On Mon, Jun 21, 2021 at 11:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > There's a large mess around do_exit() - we have a bunch of > > callers all over arch/*; if nothing else, I very much doubt that really > > want to let tracer play with a thread in the middle of die_if_kernel() > > or similar. > > Right you are. > > I'm really beginning to hate ptrace_{event,notify}() and those > PTRACE_EVENT_xyz things. > > I don't even know what uses them, honestly. How very annoying. > > I guess it's easy enough (famous last words) to move the > ptrace_event() call out of do_exit() and into the actual > exit/exit_group system calls, and the signal handling path. The paths > that actually have proper pt_regs. > > Looks like sys_exit() and do_group_exit() would be the two places to > do it (do_group_exit() would handle the signal case and > sys_group_exit()). Maybe... I'm digging through that pile right now, will follow up when I get a reasonably complete picture. In the meanwhile, do kernel/kthread.c uses look even remotely sane? Intentional - sure, but it really looks wrong to use thread exit code as communication channel there... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 19:45 ` Al Viro @ 2021-06-21 23:14 ` Linus Torvalds 2021-06-21 23:23 ` Al Viro ` (2 more replies) 0 siblings, 3 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-21 23:14 UTC (permalink / raw) To: Al Viro Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, Tetsuo Handa On Mon, Jun 21, 2021 at 12:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Looks like sys_exit() and do_group_exit() would be the two places to > > do it (do_group_exit() would handle the signal case and > > sys_group_exit()). > > Maybe... I'm digging through that pile right now, will follow up when > I get a reasonably complete picture We might have another possible way to solve this: (a) make it the rule that everybody always saves the full (integer) register set in pt_regs (b) make m68k just always create that switch-stack for all system calls (it's really not that big, I think it's like six words or something) (c) admit that alpha is broken, but nobody really cares > In the meanwhile, do kernel/kthread.c uses look even remotely sane? > Intentional - sure, but it really looks wrong to use thread exit code > as communication channel there... I really doubt that it is even "intentional". I think it's "use some errno as a random exit code" and nobody ever really thought about it, or thought about how that doesn't really work. People are used to the error numbers, not thinking about how do_exit() doesn't take an error number, but a signal number (and an 8-bit positive error code in bits 8-15). Because no, it's not even remotely sane. I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make more sense. And the -ENOMEM might be SIGBUS, perhaps. It does look like the usermode-helper code does save the exit code with things like kernel_wait(pid, &sub_info->retval); and I see call_usermodehelper_exec() doing retval = sub_info->retval; and treating it as an error code. But I think those have never been tested with that (bogus) exit code thing from kernel_wait(), because it wouldn't have worked. It has only ever been tested with the (real) exit code things like if (pid < 0) { sub_info->retval = pid; which does actually assign a negative error code to it. So I think that kernel_wait(pid, &sub_info->retval); line is buggy, and should be something like int wstatus; kernel_wait(pid, &wstatus); sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0; or something. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 23:14 ` Linus Torvalds @ 2021-06-21 23:23 ` Al Viro 2021-06-21 23:36 ` Linus Torvalds 2021-06-22 0:01 ` Michael Schmitz 2021-06-22 20:04 ` Michael Schmitz 2 siblings, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-21 23:23 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, Tetsuo Handa On Mon, Jun 21, 2021 at 04:14:36PM -0700, Linus Torvalds wrote: > On Mon, Jun 21, 2021 at 12:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > Looks like sys_exit() and do_group_exit() would be the two places to > > > do it (do_group_exit() would handle the signal case and > > > sys_group_exit()). > > > > Maybe... I'm digging through that pile right now, will follow up when > > I get a reasonably complete picture > > We might have another possible way to solve this: > > (a) make it the rule that everybody always saves the full (integer) > register set in pt_regs > > (b) make m68k just always create that switch-stack for all system > calls (it's really not that big, I think it's like six words or > something) > > (c) admit that alpha is broken, but nobody really cares How would it help e.g. oopsen on the way out of timer interrupts? IMO we simply shouldn't allow ptrace access if the tracee is in that kind of state, on any architecture... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 23:23 ` Al Viro @ 2021-06-21 23:36 ` Linus Torvalds 2021-06-22 21:02 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-21 23:36 UTC (permalink / raw) To: Al Viro Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, Tetsuo Handa On Mon, Jun 21, 2021 at 4:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > How would it help e.g. oopsen on the way out of timer interrupts? > IMO we simply shouldn't allow ptrace access if the tracee is in that kind > of state, on any architecture... Yeah no, we can't do the "wait for ptrace" when the exit is due to an oops. Although honestly, we have other cases like that where do_exit() isn't 100% robust if you kill something in an interrupt. Like all the locks it leaves locked etc. So do_exit() from a timer interrupt is going to cause problems regardless. I agree it's probably a good idea to try to avoid causing even more with the odd ptrace thing, but I don't think ptrace_event is some really "fundamental" problem at that point - it's just one detail among many many. So I was more thinking of the debug patch for m68k to catch all the _regular_ cases, and all the other random cases of ptrace_event() or ptrace_notify(). Although maybe we've really caught them all. The exit case was clearly missing, and the thread fork case was scrogged. There are patches for the known problems. The patches I really don't like are the verification ones to find any unknown ones.. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 23:36 ` Linus Torvalds @ 2021-06-22 21:02 ` Eric W. Biederman 2021-06-22 21:48 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-22 21:02 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, Tetsuo Handa Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, Jun 21, 2021 at 4:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> How would it help e.g. oopsen on the way out of timer interrupts? >> IMO we simply shouldn't allow ptrace access if the tracee is in that kind >> of state, on any architecture... > > Yeah no, we can't do the "wait for ptrace" when the exit is due to an > oops. Although honestly, we have other cases like that where do_exit() > isn't 100% robust if you kill something in an interrupt. Like all the > locks it leaves locked etc. > > So do_exit() from a timer interrupt is going to cause problems > regardless. I agree it's probably a good idea to try to avoid causing > even more with the odd ptrace thing, but I don't think ptrace_event is > some really "fundamental" problem at that point - it's just one detail > among many many. > > So I was more thinking of the debug patch for m68k to catch all the > _regular_ cases, and all the other random cases of ptrace_event() or > ptrace_notify(). > > Although maybe we've really caught them all. The exit case was clearly > missing, and the thread fork case was scrogged. There are patches for > the known problems. The patches I really don't like are the > verification ones to find any unknown ones.. We still have nios2 which copied the m68k logic at some point. I think that is a processor that is still ``shipping'' and that people might still be using in new designs. I haven't looked closely enough to see what the other architectures with caller saved registers are doing. The challenging ones are /proc/pid/syscall and seccomp which want to see all of the system call arguments. I think every architecture always saves the system call arguments unconditionally, so those cases are probably not as interesting. But they certain look like they could be trouble. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-22 21:02 ` Eric W. Biederman @ 2021-06-22 21:48 ` Michael Schmitz 2021-06-23 5:26 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-22 21:48 UTC (permalink / raw) To: Eric W. Biederman, Linus Torvalds Cc: Al Viro, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa, John Paul Adrian Glaubitz Hi Eric, On 23/06/21 9:02 am, Eric W. Biederman wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > So I was more thinking of the debug patch for m68k to catch all the > _regular_ cases, and all the other random cases of ptrace_event() or > ptrace_notify(). > > Although maybe we've really caught them all. The exit case was clearly > missing, and the thread fork case was scrogged. There are patches for > the known problems. The patches I really don't like are the > verification ones to find any unknown ones.. > We still have nios2 which copied the m68k logic at some point. I think > that is a processor that is still ``shipping'' and that people might > still be using in new designs. > > I haven't looked closely enough to see what the other architectures with > caller saved registers are doing. > > The challenging ones are /proc/pid/syscall and seccomp which want to see > all of the system call arguments. I think every architecture always > saves the system call arguments unconditionally, so those cases are > probably not as interesting. But they certain look like they could be > trouble. Seccomp hasn't yet been implemented on m68k, though I'm working on that with Adrian. The sole secure_computing() call will happen in syscall_trace_enter(), so all system call arguments have been saved on the stack. Haven't looked at /proc/pid/syscall yet ... Cheers, Michael > Eric > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-22 21:48 ` Michael Schmitz @ 2021-06-23 5:26 ` Michael Schmitz 2021-06-23 14:36 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-23 5:26 UTC (permalink / raw) To: Eric W. Biederman, Linus Torvalds Cc: Al Viro, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa, John Paul Adrian Glaubitz Hi Eric, Am 23.06.2021 um 09:48 schrieb Michael Schmitz: >> >> The challenging ones are /proc/pid/syscall and seccomp which want to see >> all of the system call arguments. I think every architecture always >> saves the system call arguments unconditionally, so those cases are >> probably not as interesting. But they certain look like they could be >> trouble. > > Seccomp hasn't yet been implemented on m68k, though I'm working on that > with Adrian. The sole secure_computing() call will happen in > syscall_trace_enter(), so all system call arguments have been saved on > the stack. > > Haven't looked at /proc/pid/syscall yet ... Not supported at present (no HAVE_ARCH_TRACEHOOK for m68k). And the syscall_get_arguments I wrote for seccomp support only copies the first five data registers, which are always saved. Cheers, Michael ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-23 5:26 ` Michael Schmitz @ 2021-06-23 14:36 ` Eric W. Biederman 0 siblings, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-23 14:36 UTC (permalink / raw) To: Michael Schmitz Cc: Linus Torvalds, Al Viro, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa, John Paul Adrian Glaubitz Michael Schmitz <schmitzmic@gmail.com> writes: > Hi Eric, > > Am 23.06.2021 um 09:48 schrieb Michael Schmitz: >>> >>> The challenging ones are /proc/pid/syscall and seccomp which want to see >>> all of the system call arguments. I think every architecture always >>> saves the system call arguments unconditionally, so those cases are >>> probably not as interesting. But they certain look like they could be >>> trouble. >> >> Seccomp hasn't yet been implemented on m68k, though I'm working on that >> with Adrian. The sole secure_computing() call will happen in >> syscall_trace_enter(), so all system call arguments have been saved on >> the stack. >> >> Haven't looked at /proc/pid/syscall yet ... > > Not supported at present (no HAVE_ARCH_TRACEHOOK for m68k). And the > syscall_get_arguments I wrote for seccomp support only copies the first five > data registers, which are always saved. Yes. It is looking like I can fix everything generically except for faking user space registers for io_uring threads. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 23:14 ` Linus Torvalds 2021-06-21 23:23 ` Al Viro @ 2021-06-22 0:01 ` Michael Schmitz 2021-06-22 20:04 ` Michael Schmitz 2 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-22 0:01 UTC (permalink / raw) To: Linus Torvalds, Al Viro Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook, Tetsuo Handa Hi Linus, On 22/06/21 11:14 am, Linus Torvalds wrote: > On Mon, Jun 21, 2021 at 12:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote: >>> Looks like sys_exit() and do_group_exit() would be the two places to >>> do it (do_group_exit() would handle the signal case and >>> sys_group_exit()). >> Maybe... I'm digging through that pile right now, will follow up when >> I get a reasonably complete picture > We might have another possible way to solve this: > > (a) make it the rule that everybody always saves the full (integer) > register set in pt_regs > > (b) make m68k just always create that switch-stack for all system > calls (it's really not that big, I think it's like six words or > something) Correct - six words for registers, one for the return address. Probably still a win compared to setting and clearing flag bits all over the place in an attempt to catch any as yet undetected unsafe cases of ptrace_stop. I'll have to see how much of a performance impact I can see (not that I can even remotely measure that accurately - it's more of a 'does it now feel real sluggish' thing). Cheers, Michael > > (c) admit that alpha is broken, but nobody really cares > >> In the meanwhile, do kernel/kthread.c uses look even remotely sane? >> Intentional - sure, but it really looks wrong to use thread exit code >> as communication channel there... > I really doubt that it is even "intentional". > > I think it's "use some errno as a random exit code" and nobody ever > really thought about it, or thought about how that doesn't really > work. People are used to the error numbers, not thinking about how > do_exit() doesn't take an error number, but a signal number (and an > 8-bit positive error code in bits 8-15). > > Because no, it's not even remotely sane. > > I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make > more sense. And the -ENOMEM might be SIGBUS, perhaps. > > It does look like the usermode-helper code does save the exit code > with things like > > kernel_wait(pid, &sub_info->retval); > > and I see call_usermodehelper_exec() doing > > retval = sub_info->retval; > > and treating it as an error code. But I think those have never been > tested with that (bogus) exit code thing from kernel_wait(), because > it wouldn't have worked. It has only ever been tested with the (real) > exit code things like > > if (pid < 0) { > sub_info->retval = pid; > > which does actually assign a negative error code to it. > > So I think that > > kernel_wait(pid, &sub_info->retval); > > line is buggy, and should be something like > > int wstatus; > kernel_wait(pid, &wstatus); > sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0; > > or something. > > Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 23:14 ` Linus Torvalds 2021-06-21 23:23 ` Al Viro 2021-06-22 0:01 ` Michael Schmitz @ 2021-06-22 20:04 ` Michael Schmitz 2021-06-22 20:18 ` Al Viro 2 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-22 20:04 UTC (permalink / raw) To: Linus Torvalds, Al Viro Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa, Andreas Schwab Hi Linus, On 22/06/21 11:14 am, Linus Torvalds wrote: > On Mon, Jun 21, 2021 at 12:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote: >>> Looks like sys_exit() and do_group_exit() would be the two places to >>> do it (do_group_exit() would handle the signal case and >>> sys_group_exit()). >> Maybe... I'm digging through that pile right now, will follow up when >> I get a reasonably complete picture > We might have another possible way to solve this: > > (a) make it the rule that everybody always saves the full (integer) > register set in pt_regs > > (b) make m68k just always create that switch-stack for all system > calls (it's really not that big, I think it's like six words or > something) Turns out that is harder than it looked at first glance (at least for me). All syscalls that _do_ save the switch stack are currently called through wrappers which pull the syscall arguments out of the saved pt_regs on the stack (pushing the switch stack after the SAVE_ALL saved stuff buries the syscall arguments on the stack, see comment about m68k_clone(). We'd have to push the switch stack _first_ when entering system_call to leave the syscall arguments in place, but that will require further changes to the syscall exit path (currently shared with the interrupt exit path). Not to mention the register offset calculations in arch/m68k/kernel/ptrace.c, and perhaps a few other dependencies that don't come to mind immediately. We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the ordering of the two is only mentioned in a comment. Can we reorder them on the stack, as long as we don't change the struct definitions proper? This will take a little more time to work out and test - certainly not before the weekend. I'll send a corrected version of my debug patch before that. Cheers, Michael > > (c) admit that alpha is broken, but nobody really cares > >> In the meanwhile, do kernel/kthread.c uses look even remotely sane? >> Intentional - sure, but it really looks wrong to use thread exit code >> as communication channel there... > I really doubt that it is even "intentional". > > I think it's "use some errno as a random exit code" and nobody ever > really thought about it, or thought about how that doesn't really > work. People are used to the error numbers, not thinking about how > do_exit() doesn't take an error number, but a signal number (and an > 8-bit positive error code in bits 8-15). > > Because no, it's not even remotely sane. > > I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make > more sense. And the -ENOMEM might be SIGBUS, perhaps. > > It does look like the usermode-helper code does save the exit code > with things like > > kernel_wait(pid, &sub_info->retval); > > and I see call_usermodehelper_exec() doing > > retval = sub_info->retval; > > and treating it as an error code. But I think those have never been > tested with that (bogus) exit code thing from kernel_wait(), because > it wouldn't have worked. It has only ever been tested with the (real) > exit code things like > > if (pid < 0) { > sub_info->retval = pid; > > which does actually assign a negative error code to it. > > So I think that > > kernel_wait(pid, &sub_info->retval); > > line is buggy, and should be something like > > int wstatus; > kernel_wait(pid, &wstatus); > sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0; > > or something. > > Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-22 20:04 ` Michael Schmitz @ 2021-06-22 20:18 ` Al Viro 2021-06-22 21:57 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-22 20:18 UTC (permalink / raw) To: Michael Schmitz Cc: Linus Torvalds, Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa, Andreas Schwab On Wed, Jun 23, 2021 at 08:04:11AM +1200, Michael Schmitz wrote: > All syscalls that _do_ save the switch stack are currently called through > wrappers which pull the syscall arguments out of the saved pt_regs on the > stack (pushing the switch stack after the SAVE_ALL saved stuff buries the > syscall arguments on the stack, see comment about m68k_clone(). We'd have to > push the switch stack _first_ when entering system_call to leave the syscall > arguments in place, but that will require further changes to the syscall > exit path (currently shared with the interrupt exit path). Not to mention > the register offset calculations in arch/m68k/kernel/ptrace.c, and perhaps a > few other dependencies that don't come to mind immediately. > > We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the ordering > of the two is only mentioned in a comment. Can we reorder them on the stack, > as long as we don't change the struct definitions proper? > > This will take a little more time to work out and test - certainly not > before the weekend. I'll send a corrected version of my debug patch before > that. This is insane, *especially* on m68k where you have the mess with different frame layouts and associated ->stkadj crap (see mangle_kernel_stack() for the (very) full barfbag). ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-22 20:18 ` Al Viro @ 2021-06-22 21:57 ` Michael Schmitz 0 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-22 21:57 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Tetsuo Handa, Andreas Schwab Hi Al, On 23/06/21 8:18 am, Al Viro wrote: > On Wed, Jun 23, 2021 at 08:04:11AM +1200, Michael Schmitz wrote: > >> All syscalls that _do_ save the switch stack are currently called through >> wrappers which pull the syscall arguments out of the saved pt_regs on the >> stack (pushing the switch stack after the SAVE_ALL saved stuff buries the >> syscall arguments on the stack, see comment about m68k_clone(). We'd have to >> push the switch stack _first_ when entering system_call to leave the syscall >> arguments in place, but that will require further changes to the syscall >> exit path (currently shared with the interrupt exit path). Not to mention >> the register offset calculations in arch/m68k/kernel/ptrace.c, and perhaps a >> few other dependencies that don't come to mind immediately. >> >> We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the ordering >> of the two is only mentioned in a comment. Can we reorder them on the stack, >> as long as we don't change the struct definitions proper? >> >> This will take a little more time to work out and test - certainly not >> before the weekend. I'll send a corrected version of my debug patch before >> that. > This is insane, *especially* on m68k where you have the mess with different > frame layouts and associated ->stkadj crap (see mangle_kernel_stack() for > the (very) full barfbag). Indeed - that's one of the uses of pt_regs and switch_stack that I hadn't yet seen. So it's either leave the stack layout in system calls unchanged (aside from the ones that need the extra registers) and protect against accidental misuse of registers that weren't saved, with the overhead of playing with thread_info->status bits, or tackle the mess of redoing the stack layout to save all registers, always (did I already mention that I'd need a _lot_ of help from someone more conversant with m68k assembly coding for that option?). Which one of these two barf bags is the fuller one? Cheers, Michael ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 19:22 ` Linus Torvalds 2021-06-21 19:45 ` Al Viro @ 2021-06-21 20:03 ` Eric W. Biederman 2021-06-21 23:15 ` Linus Torvalds 1 sibling, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-21 20:03 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, Jun 21, 2021 at 11:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> There's a large mess around do_exit() - we have a bunch of >> callers all over arch/*; if nothing else, I very much doubt that really >> want to let tracer play with a thread in the middle of die_if_kernel() >> or similar. > > Right you are. > > I'm really beginning to hate ptrace_{event,notify}() and those > PTRACE_EVENT_xyz things. > > I don't even know what uses them, honestly. How very annoying. Modern strace does. Modern gdb appears not to. However strace at least does not read the exit code, or really appear to care about stopping for PTRACE_EVENT_EXIT. I completely agree with you that they are very annoying. > I guess it's easy enough (famous last words) to move the > ptrace_event() call out of do_exit() and into the actual > exit/exit_group system calls, and the signal handling path. The paths > that actually have proper pt_regs. > > Looks like sys_exit() and do_group_exit() would be the two places to > do it (do_group_exit() would handle the signal case and > sys_group_exit()). For other ptrace_event calls I am playing with seeing if I can split them in two. Like sending a signal. So that we can have perform all of the work in get_signal. I think we can even change exit_group(2) and exit(2) so that (at least when ptraced) they just send the ``event signal'' and then the signal handling path handles all of the ptrace stuff. When I started it was just going to be exit and PTRACE_EVENT_EXIT and some old architectures, and that a generic solution was going to be hard. I still think we are going to need to fix the io_uring threads on the architectures that use the caller saved register optimization like alpha and m68k. But I think we can handle the rest in generic code. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 20:03 ` Eric W. Biederman @ 2021-06-21 23:15 ` Linus Torvalds 2021-06-22 20:52 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-21 23:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 1:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > For other ptrace_event calls I am playing with seeing if I can split > them in two. Like sending a signal. So that we can have perform all > of the work in get_signal. That sounds like the right model, but I don't think it works. Particularly not for exit(). The second phase will never happen. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 23:15 ` Linus Torvalds @ 2021-06-22 20:52 ` Eric W. Biederman 2021-06-23 0:41 ` Linus Torvalds 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-22 20:52 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, Jun 21, 2021 at 1:04 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> For other ptrace_event calls I am playing with seeing if I can split >> them in two. Like sending a signal. So that we can have perform all >> of the work in get_signal. > > That sounds like the right model, but I don't think it works. > Particularly not for exit(). The second phase will never happen. Playing with it some more I think I have everything working working except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and group_exit(2). Basically in exit sending yourself a signal and then calling do_exit from the signal handler is not unreasonable, as exit is an ordinary system call. I haven't seen anything that ``knows'' that exit(2) or exit_group(2) will never return and adds a special case in the system call table for that case. The complications of exit_group(2) are roughly those of moving ptrace_event out of do_exit. They look doable and I am going to look at that next. This is not to say that this is the most maintainable way or that we necessarily want to implement things this way, but I need to look and see what it looks like. For purposes of discussion this is my current draft implementation. diff --git a/include/linux/sched.h b/include/linux/sched.h index d2c881384517..891812d32b90 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1087,6 +1087,7 @@ struct task_struct { struct capture_control *capture_control; #endif /* Ptrace state: */ + int stop_code; unsigned long ptrace_message; kernel_siginfo_t *last_siginfo; diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h index b5ebf6c01292..33c50119b193 100644 --- a/include/linux/ptrace.h +++ b/include/linux/ptrace.h @@ -164,18 +164,29 @@ static inline void ptrace_event(int event, unsigned long message) } } +static inline bool ptrace_post_event(int event, unsigned long message) +{ + bool posted = false; + if (unlikely(ptrace_event_enabled(current, event))) { + current->ptrace_message = message; + current->stop_code = (event << 8) | SIGTRAP; + set_tsk_thread_flag(current, TIF_SIGPENDING); + posted = true; + } else if (event == PTRACE_EVENT_EXEC) { + /* legacy EXEC report via SIGTRAP */ + if ((current->ptrace & (PT_PTRACED|PT_SEIZED)) == PT_PTRACED) + send_sig(SIGTRAP, current, 0); + } + return posted; +} + /** - * ptrace_event_pid - possibly stop for a ptrace event notification - * @event: %PTRACE_EVENT_* value to report - * @pid: process identifier for %PTRACE_GETEVENTMSG to return - * - * Check whether @event is enabled and, if so, report @event and @pid - * to the ptrace parent. @pid is reported as the pid_t seen from the - * ptrace parent's pid namespace. + * pid_parent_nr - Return the number the parent knows this pid as + * @pid: The struct pid whose numerical value we want * * Called without locks. */ -static inline void ptrace_event_pid(int event, struct pid *pid) +static inline pid_t pid_parent_nr(struct pid *pid) { /* * FIXME: There's a potential race if a ptracer in a different pid @@ -183,16 +194,15 @@ static inline void ptrace_event_pid(int event, struct pid *pid) * when we acquire tasklist_lock in ptrace_stop(). If this happens, * the ptracer will get a bogus pid from PTRACE_GETEVENTMSG. */ - unsigned long message = 0; + pid_t nr = 0; struct pid_namespace *ns; rcu_read_lock(); ns = task_active_pid_ns(rcu_dereference(current->parent)); if (ns) - message = pid_nr_ns(pid, ns); + nr = pid_nr_ns(pid, ns); rcu_read_unlock(); - - ptrace_event(event, message); + return nr; } /** diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index e24b1fe348e3..a2eac3831369 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -97,6 +97,8 @@ extern void exit_mm_release(struct task_struct *, struct mm_struct *); /* Remove the current tasks stale references to the old mm_struct on exec() */ extern void exec_mm_release(struct task_struct *, struct mm_struct *); +extern int wait_for_vfork_done(struct task_struct *child, struct completion *vfork); + #ifdef CONFIG_MEMCG extern void mm_update_next_owner(struct mm_struct *mm); #else diff --git a/fs/exec.c b/fs/exec.c index 18594f11c31f..bb4751d84e2d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1781,7 +1781,7 @@ static int exec_binprm(struct linux_binprm *bprm) audit_bprm(bprm); trace_sched_process_exec(current, old_pid, bprm); - ptrace_event(PTRACE_EVENT_EXEC, old_vpid); + ptrace_post_event(PTRACE_EVENT_EXEC, old_vpid); proc_exec_connector(current); return 0; } diff --git a/kernel/exit.c b/kernel/exit.c index fd1c04193e18..aeb22a8e4d24 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit); SYSCALL_DEFINE1(exit, int, error_code) { - do_exit((error_code&0xff)<<8); + long code = (error_code&0xff)<<8; + if (!ptrace_post_event(PTRACE_EVENT_EXIT, code)) + do_exit((error_code&0xff)<<8); } /* diff --git a/kernel/fork.c b/kernel/fork.c index dc06afd725cb..8533e056a3d6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1266,8 +1266,7 @@ static void complete_vfork_done(struct task_struct *tsk) task_unlock(tsk); } -static int wait_for_vfork_done(struct task_struct *child, - struct completion *vfork) +int wait_for_vfork_done(struct task_struct *child, struct completion *vfork) { int killed; @@ -2278,7 +2277,8 @@ static __latent_entropy struct task_struct *copy_process( init_task_pid_links(p); if (likely(p->pid)) { - ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace); + ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || + (trace && ptrace_event_enabled(current, trace))); init_task_pid(p, PIDTYPE_PID, pid); if (thread_group_leader(p)) { @@ -2462,7 +2462,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) pid_t kernel_clone(struct kernel_clone_args *args) { u64 clone_flags = args->flags; - struct completion vfork; + unsigned long message; struct pid *pid; struct task_struct *p; int trace = 0; @@ -2495,9 +2495,6 @@ pid_t kernel_clone(struct kernel_clone_args *args) trace = PTRACE_EVENT_CLONE; else trace = PTRACE_EVENT_FORK; - - if (likely(!ptrace_event_enabled(current, trace))) - trace = 0; } p = copy_process(NULL, trace, NUMA_NO_NODE, args); @@ -2512,30 +2509,27 @@ pid_t kernel_clone(struct kernel_clone_args *args) */ trace_sched_process_fork(current, p); - pid = get_task_pid(p, PIDTYPE_PID); + pid = task_pid(p); nr = pid_vnr(pid); + message = pid_parent_nr(pid); if (clone_flags & CLONE_PARENT_SETTID) put_user(nr, args->parent_tid); - if (clone_flags & CLONE_VFORK) { - p->vfork_done = &vfork; + if (!(clone_flags & CLONE_VFORK)) { + wake_up_new_task(p); + ptrace_post_event(trace, message); + } + else if (!ptrace_post_event(PTRACE_EVENT_VFORK, (unsigned long)p)) { + struct completion vfork; init_completion(&vfork); + p->vfork_done = &vfork; get_task_struct(p); + wake_up_new_task(p); + if (wait_for_vfork_done(p, &vfork)) + ptrace_post_event(PTRACE_EVENT_VFORK_DONE, message); } - wake_up_new_task(p); - - /* forking complete and child started to run, tell ptracer */ - if (unlikely(trace)) - ptrace_event_pid(trace, pid); - - if (clone_flags & CLONE_VFORK) { - if (!wait_for_vfork_done(p, &vfork)) - ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid); - } - - put_pid(pid); return nr; } diff --git a/kernel/signal.c b/kernel/signal.c index f7c6ffcbd044..8ac8c4a31d88 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -155,7 +155,8 @@ static bool recalc_sigpending_tsk(struct task_struct *t) if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE)) || PENDING(&t->pending, &t->blocked) || PENDING(&t->signal->shared_pending, &t->blocked) || - cgroup_task_frozen(t)) { + cgroup_task_frozen(t) || + t->stop_code) { set_tsk_thread_flag(t, TIF_SIGPENDING); return true; } @@ -2607,6 +2608,39 @@ bool get_signal(struct ksignal *ksig) if (unlikely(current->task_works)) task_work_run(); +ptrace_event: + /* Handle a posted ptrace event */ + if (unlikely(current->stop_code)) { + int stop_code = current->stop_code; + unsigned long message = current->ptrace_message; + struct completion vfork; + struct task_struct *p; + + current->stop_code = 0; + + if (stop_code == PTRACE_EVENT_VFORK) { + p = (struct task_struct *)message; + get_task_struct(p); + current->ptrace_message = pid_parent_nr(task_pid(p)); + init_completion(&vfork); + p->vfork_done = &vfork; + wake_up_new_task(p); + } + + spin_lock_irq(&sighand->siglock); + ptrace_do_notify(SIGTRAP, stop_code, CLD_TRAPPED); + spin_unlock_irq(&sighand->siglock); + + if ((stop_code == PTRACE_EVENT_VFORK) && + wait_for_vfork_done(p, &vfork) && + ptrace_post_event(PTRACE_EVENT_VFORK_DONE, message)) + goto ptrace_event; + + if (stop_code == PTRACE_EVENT_EXIT) { + do_exit(message); + } + } + /* * For non-generic architectures, check for TIF_NOTIFY_SIGNAL so * that the arch handlers don't all have to do it. If we get here Eric ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-22 20:52 ` Eric W. Biederman @ 2021-06-23 0:41 ` Linus Torvalds 2021-06-23 14:33 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-23 0:41 UTC (permalink / raw) To: Eric W. Biederman Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Tue, Jun 22, 2021 at 1:53 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Playing with it some more I think I have everything working working > except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and > group_exit(2). > > Basically in exit sending yourself a signal and then calling do_exit > from the signal handler is not unreasonable, as exit is an ordinary > system call. Ok, this is a bit odd, but I do like the concept of just making ptrace_event just post a signal, and have all ptrace things always be handled at signal time (or the special system call entry/exit, which is fine too). > For purposes of discussion this is my current draft implementation. I didn't check what is so different about exit_group() that you left that as an exercise for the reader, but if that ends up then removing the whole "wait synchromously for ptrace" cases for good I don't _hate_ this. It's a bit odd, but it would be really nice to limit where ptrace picks up data. We do end up doing that stuff in "get_signal()", and that means that we have the interaction with io_uring calling it directly, but it's at least not a new thing. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-23 0:41 ` Linus Torvalds @ 2021-06-23 14:33 ` Eric W. Biederman 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-23 14:33 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, Jun 22, 2021 at 1:53 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> Playing with it some more I think I have everything working working >> except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and >> group_exit(2). >> >> Basically in exit sending yourself a signal and then calling do_exit >> from the signal handler is not unreasonable, as exit is an ordinary >> system call. > > Ok, this is a bit odd, but I do like the concept of just making > ptrace_event just post a signal, and have all ptrace things always be > handled at signal time (or the special system call entry/exit, which > is fine too). > >> For purposes of discussion this is my current draft implementation. > > I didn't check what is so different about exit_group() that you left > that as an exercise for the reader, but if that ends up then removing > the whole "wait synchromously for ptrace" cases for good I don't > _hate_ this. It's a bit odd, but it would be really nice to limit > where ptrace picks up data. I am still figuring out exit_group. I am hoping for sometime today. My intuition tells me I can do it, and I have a sense of what threads I need to pull to get there. I just don't know what the code is going to look like yet. Basically solving exit_group means moving ptrace_event out of do_exit. > We do end up doing that stuff in "get_signal()", and that means that > we have the interaction with io_uring calling it directly, but it's at > least not a new thing. The ugliest bit is having to repeat the wait_for_vfork_done both in fork and in get_signal. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 0/9] Refactoring exit 2021-06-23 14:33 ` Eric W. Biederman @ 2021-06-24 18:57 ` Eric W. Biederman 2021-06-24 18:59 ` [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) Eric W. Biederman ` (9 more replies) 0 siblings, 10 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 18:57 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook I dug into exit because PTRACE_EVENT_EXIT not being guaranteed to be called with a stack where ptrace read and write all of the userspace registers can lead to unfiltered reads and writes of kernel stack contents. While looking into it I realized that there are a lot of little races between all of the ways an exit can be initiated. I don't know of a way those races are harmful, but they make the code difficult to reason about. The solution this set of changes adopts is to implement good primitives for asynchronous exit and exit_group requests and modifies exit(2) and exit_group(2) to use those primitives. The result should be more consistent determination of the reason for an exit, as well as PTRACE_EVENT_EXIT always being called from a context (get_signal) where ptrace is guaranteed to be able to read and write all of the registers. I believe the set of changes could be justified for the cleanups alone even if PTRACE_EVENT_EXIT did not need to be moved. Which makes me feel good about this approach. If a way can be found that coredumps can be started from complete_signal (needed for timely handling of fatal signals) instead of needing to start in do_coredump for proper synchronization force_siginfo_to_task and get_signal can be significantly simplified. As it is a lot of checks are duplicated to ensure that everything works properly in the presence of do_coredump. So far the code has been lightly tested, and the descriptions of some of the patches are a bit light, but I think this shows the direction I am aiming to travel for sorting out exit(2) and exit_group(2). Eric W. Biederman (9): signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) signal/seccomp: Refactor seccomp signal and coredump generation signal/seccomp: Dump core when there is only one live thread signal: Factor start_group_exit out of complete_signal signal/group_exit: Use start_group_exit in place of do_group_exit signal: Fold do_group_exit into get_signal fixing io_uring threads signal: Make individual tasks exiting a first class concept. signal/task_exit: Use start_task_exit in place of do_exit signal: Move PTRACE_EVENT_EXIT into get_signal arch/sh/kernel/cpu/fpu.c | 10 +-- fs/exec.c | 10 ++- include/linux/sched/jobctl.h | 2 + include/linux/sched/signal.h | 5 ++ include/linux/sched/task.h | 1 - kernel/exit.c | 41 ++--------- kernel/seccomp.c | 45 +++--------- kernel/signal.c | 166 ++++++++++++++++++++++++++++++------------- 8 files changed, 154 insertions(+), 126 deletions(-) ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman @ 2021-06-24 18:59 ` Eric W. Biederman 2021-06-24 18:59 ` [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation Eric W. Biederman ` (8 subsequent siblings) 9 siblings, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 18:59 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Today the sh code allocates memory the first time a process uses the fpu. If that memory allocation fails kill the affected task with force_sig(SIGKILL) rather than do_group_exit(SIGKILL). Calling do_group_exit from an exception handler can potentially lead to locking dead locks as do_group_exit is not designed to be called from interrupt context. Instead use force_sig(SIGKILL) to kill the userspace process. Sending signals in general and force_sig in particular has been tested from interrupt context so there should be no problems. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- arch/sh/kernel/cpu/fpu.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/sh/kernel/cpu/fpu.c b/arch/sh/kernel/cpu/fpu.c index ae354a2931e7..fd6db0ab1928 100644 --- a/arch/sh/kernel/cpu/fpu.c +++ b/arch/sh/kernel/cpu/fpu.c @@ -62,18 +62,20 @@ void fpu_state_restore(struct pt_regs *regs) } if (!tsk_used_math(tsk)) { - local_irq_enable(); + int ret; /* * does a slab alloc which can sleep */ - if (init_fpu(tsk)) { + local_irq_enable(); + ret = init_fpu(tsk); + local_irq_disable(); + if (ret) { /* * ran out of memory! */ - do_group_exit(SIGKILL); + force_sig(SIGKILL); return; } - local_irq_disable(); } grab_fpu(regs); -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman 2021-06-24 18:59 ` [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) Eric W. Biederman @ 2021-06-24 18:59 ` Eric W. Biederman 2021-06-26 3:17 ` Kees Cook 2021-06-24 19:00 ` [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread Eric W. Biederman ` (7 subsequent siblings) 9 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 18:59 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Factor out force_sig_seccomp from the seccomp signal generation and place it in kernel/signal.c. The function force_sig_seccomp takes a paramter force_coredump to indicate that the sigaction field should be reset to SIGDFL so that a coredump will be generated when the signal is delivered. force_sig_seccomp is then used to replace both seccomp_send_sigsys and seccomp_init_siginfo. force_sig_info_to_task gains an extra parameter to force using the default signal action. With this change seccomp is no longer a special case and there becomes exactly one place do_coredump is called from. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- include/linux/sched/signal.h | 1 + kernel/seccomp.c | 43 ++++++++---------------------------- kernel/signal.c | 30 +++++++++++++++++++++---- 3 files changed, 36 insertions(+), 38 deletions(-) diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 7f4278fa21fe..774be5d3ac3e 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -329,6 +329,7 @@ int force_sig_pkuerr(void __user *addr, u32 pkey); int force_sig_perf(void __user *addr, u32 type, u64 sig_data); int force_sig_ptrace_errno_trap(int errno, void __user *addr); +int force_sig_seccomp(int syscall, int reason, bool force_coredump); extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *); extern void force_sigsegv(int sig); diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 6ecd3f3a52b5..3e06d4628d98 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -920,30 +920,6 @@ void get_seccomp_filter(struct task_struct *tsk) refcount_inc(&orig->users); } -static void seccomp_init_siginfo(kernel_siginfo_t *info, int syscall, int reason) -{ - clear_siginfo(info); - info->si_signo = SIGSYS; - info->si_code = SYS_SECCOMP; - info->si_call_addr = (void __user *)KSTK_EIP(current); - info->si_errno = reason; - info->si_arch = syscall_get_arch(current); - info->si_syscall = syscall; -} - -/** - * seccomp_send_sigsys - signals the task to allow in-process syscall emulation - * @syscall: syscall number to send to userland - * @reason: filter-supplied reason code to send to userland (via si_errno) - * - * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info. - */ -static void seccomp_send_sigsys(int syscall, int reason) -{ - struct kernel_siginfo info; - seccomp_init_siginfo(&info, syscall, reason); - force_sig_info(&info); -} #endif /* CONFIG_SECCOMP_FILTER */ /* For use with seccomp_actions_logged */ @@ -1195,7 +1171,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, /* Show the handler the original registers. */ syscall_rollback(current, current_pt_regs()); /* Let the filter pass back 16 bits of data. */ - seccomp_send_sigsys(this_syscall, data); + force_sig_seccomp(this_syscall, data, false); goto skip; case SECCOMP_RET_TRACE: @@ -1266,18 +1242,17 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, /* Dump core only if this is the last remaining thread. */ if (action != SECCOMP_RET_KILL_THREAD || get_nr_threads(current) == 1) { - kernel_siginfo_t info; - /* Show the original registers in the dump. */ syscall_rollback(current, current_pt_regs()); - /* Trigger a manual coredump since do_exit skips it. */ - seccomp_init_siginfo(&info, this_syscall, data); - do_coredump(&info); + /* Trigger a coredump with SIGSYS */ + force_sig_seccomp(this_syscall, data, true); + } else { + if (action == SECCOMP_RET_KILL_THREAD) + do_exit(SIGSYS); + else + do_group_exit(SIGSYS); } - if (action == SECCOMP_RET_KILL_THREAD) - do_exit(SIGSYS); - else - do_group_exit(SIGSYS); + return -1; } unreachable(); diff --git a/kernel/signal.c b/kernel/signal.c index f7c6ffcbd044..da37cc4515f2 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -54,6 +54,7 @@ #include <asm/unistd.h> #include <asm/siginfo.h> #include <asm/cacheflush.h> +#include <asm/syscall.h> /* for syscall_get_* */ /* * SLAB caches for signal bits. @@ -1349,7 +1350,7 @@ int do_send_sig_info(int sig, struct kernel_siginfo *info, struct task_struct *p * that is why we also clear SIGNAL_UNKILLABLE. */ static int -force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t) +force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t, bool sigdfl) { unsigned long int flags; int ret, blocked, ignored; @@ -1360,7 +1361,7 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t) action = &t->sighand->action[sig-1]; ignored = action->sa.sa_handler == SIG_IGN; blocked = sigismember(&t->blocked, sig); - if (blocked || ignored) { + if (blocked || ignored || sigdfl) { action->sa.sa_handler = SIG_DFL; if (blocked) { sigdelset(&t->blocked, sig); @@ -1381,7 +1382,7 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t) int force_sig_info(struct kernel_siginfo *info) { - return force_sig_info_to_task(info, current); + return force_sig_info_to_task(info, current, false); } /* @@ -1712,7 +1713,7 @@ int force_sig_fault_to_task(int sig, int code, void __user *addr info.si_flags = flags; info.si_isr = isr; #endif - return force_sig_info_to_task(&info, t); + return force_sig_info_to_task(&info, t, false); } int force_sig_fault(int sig, int code, void __user *addr @@ -1820,6 +1821,27 @@ int force_sig_perf(void __user *addr, u32 type, u64 sig_data) return force_sig_info(&info); } +/** + * force_sig_seccomp - signals the task to allow in-process syscall emulation + * @syscall: syscall number to send to userland + * @reason: filter-supplied reason code to send to userland (via si_errno) + * + * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info. + */ +int force_sig_seccomp(int syscall, int reason, bool force_coredump) +{ + struct kernel_siginfo info; + + clear_siginfo(&info); + info.si_signo = SIGSYS; + info.si_code = SYS_SECCOMP; + info.si_call_addr = (void __user *)KSTK_EIP(current); + info.si_errno = reason; + info.si_arch = syscall_get_arch(current); + info.si_syscall = syscall; + return force_sig_info_to_task(&info, current, force_coredump); +} + /* For the crazy architectures that include trap information in * the errno field, instead of an actual errno value. */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation 2021-06-24 18:59 ` [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation Eric W. Biederman @ 2021-06-26 3:17 ` Kees Cook 2021-06-28 19:21 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Kees Cook @ 2021-06-26 3:17 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo On Thu, Jun 24, 2021 at 01:59:55PM -0500, Eric W. Biederman wrote: > > Factor out force_sig_seccomp from the seccomp signal generation and > place it in kernel/signal.c. The function force_sig_seccomp takes a > paramter force_coredump to indicate that the sigaction field should be > reset to SIGDFL so that a coredump will be generated when the signal > is delivered. Ah! This is the part I missed when I was originally trying to figure out the coredump stuff. It's the need for setting a default handler (i.e. doing a coredump)? > force_sig_seccomp is then used to replace both seccomp_send_sigsys > and seccomp_init_siginfo. > > force_sig_info_to_task gains an extra parameter to force using > the default signal action. > > With this change seccomp is no longer a special case and there > becomes exactly one place do_coredump is called from. Looks good to me. This may benefit from force_sig_seccomp() to be wrapped in an #ifdef CONFIG_SECCOMP. (This patch reminds me that the seccomp self tests don't check for core dumps...) -- Kees Cook ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation 2021-06-26 3:17 ` Kees Cook @ 2021-06-28 19:21 ` Eric W. Biederman 0 siblings, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-28 19:21 UTC (permalink / raw) To: Kees Cook Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo Kees Cook <keescook@chromium.org> writes: > On Thu, Jun 24, 2021 at 01:59:55PM -0500, Eric W. Biederman wrote: >> >> Factor out force_sig_seccomp from the seccomp signal generation and >> place it in kernel/signal.c. The function force_sig_seccomp takes a >> paramter force_coredump to indicate that the sigaction field should be >> reset to SIGDFL so that a coredump will be generated when the signal >> is delivered. > > Ah! This is the part I missed when I was originally trying to figure > out the coredump stuff. It's the need for setting a default handler > (i.e. doing a coredump)? Yes. If we don't force the handler to SIG_DFL someone might catch SIGSYS. >> force_sig_seccomp is then used to replace both seccomp_send_sigsys >> and seccomp_init_siginfo. >> >> force_sig_info_to_task gains an extra parameter to force using >> the default signal action. >> >> With this change seccomp is no longer a special case and there >> becomes exactly one place do_coredump is called from. > > Looks good to me. This may benefit from force_sig_seccomp() to be wrapped > in an #ifdef CONFIG_SECCOMP. At which point Linus will probably be grumpy with me for introducing #ifdefs. I suspect seccomp at this point is sufficiently common that is probably more productive to figure out how to remove #ifdef CONFIG_SECCOMP. > (This patch reminds me that the seccomp self tests don't check for core > dumps...) This patch is slightly wrong in that it kept the call to do_group_exit when it can never be reached. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman 2021-06-24 18:59 ` [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) Eric W. Biederman 2021-06-24 18:59 ` [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation Eric W. Biederman @ 2021-06-24 19:00 ` Eric W. Biederman 2021-06-26 3:20 ` Kees Cook 2021-06-24 19:01 ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman ` (6 subsequent siblings) 9 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 19:00 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Replace get_nr_threads with atomic_read(¤t->signal->live) as that is a more accurate number that is decremented sooner. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/seccomp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 3e06d4628d98..5301eca670a0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1241,7 +1241,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, seccomp_log(this_syscall, SIGSYS, action, true); /* Dump core only if this is the last remaining thread. */ if (action != SECCOMP_RET_KILL_THREAD || - get_nr_threads(current) == 1) { + (atomic_read(¤t->signal->live) == 1)) { /* Show the original registers in the dump. */ syscall_rollback(current, current_pt_regs()); /* Trigger a coredump with SIGSYS */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread 2021-06-24 19:00 ` [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread Eric W. Biederman @ 2021-06-26 3:20 ` Kees Cook 0 siblings, 0 replies; 119+ messages in thread From: Kees Cook @ 2021-06-26 3:20 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo On Thu, Jun 24, 2021 at 02:00:22PM -0500, Eric W. Biederman wrote: > Replace get_nr_threads with atomic_read(¤t->signal->live) as > that is a more accurate number that is decremented sooner. Okay, seems fine to me. :) -- Kees Cook ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 4/9] signal: Factor start_group_exit out of complete_signal 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman ` (2 preceding siblings ...) 2021-06-24 19:00 ` [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread Eric W. Biederman @ 2021-06-24 19:01 ` Eric W. Biederman 2021-06-24 20:04 ` Linus Torvalds 2021-06-26 3:24 ` Kees Cook 2021-06-24 19:01 ` [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit Eric W. Biederman ` (5 subsequent siblings) 9 siblings, 2 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 19:01 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- include/linux/sched/signal.h | 2 ++ kernel/signal.c | 52 +++++++++++++++++++++++++----------- 2 files changed, 39 insertions(+), 15 deletions(-) diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 774be5d3ac3e..c007e55cb119 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -428,6 +428,8 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume) signal_wake_up_state(t, resume ? __TASK_TRACED : 0); } +void start_group_exit(int exit_code); + void task_join_group_stop(struct task_struct *task); #ifdef TIF_RESTORE_SIGMASK diff --git a/kernel/signal.c b/kernel/signal.c index da37cc4515f2..c79c010ca5f3 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1027,6 +1027,42 @@ static inline bool wants_signal(int sig, struct task_struct *p) return task_curr(p) || !task_sigpending(p); } +static void start_group_exit_locked(struct signal_struct *signal, int exit_code) +{ + /* + * Start a group exit and wake everybody up. + * This way we don't have other threads + * running and doing things after a slower + * thread has the fatal signal pending. + */ + struct task_struct *t; + + signal->flags = SIGNAL_GROUP_EXIT; + signal->group_exit_code = exit_code; + signal->group_stop_count = 0; + __for_each_thread(signal, t) { + task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); + + /* Don't bother with already dead threads */ + if (t->exit_state) + continue; + sigaddset(&t->pending.signal, SIGKILL); + signal_wake_up(t, 1); + } +} + +void start_group_exit(int exit_code) +{ + if (!fatal_signal_pending(current)) { + struct sighand_struct *const sighand = current->sighand; + + spin_lock_irq(&sighand->siglock); + if (!fatal_signal_pending(current)) + start_group_exit_locked(current->signal, exit_code); + spin_unlock_irq(&sighand->siglock); + } +} + static void complete_signal(int sig, struct task_struct *p, enum pid_type type) { struct signal_struct *signal = p->signal; @@ -1076,21 +1112,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) * This signal will be fatal to the whole group. */ if (!sig_kernel_coredump(sig)) { - /* - * Start a group exit and wake everybody up. - * This way we don't have other threads - * running and doing things after a slower - * thread has the fatal signal pending. - */ - signal->flags = SIGNAL_GROUP_EXIT; - signal->group_exit_code = sig; - signal->group_stop_count = 0; - t = p; - do { - task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - sigaddset(&t->pending.signal, SIGKILL); - signal_wake_up(t, 1); - } while_each_thread(p, t); + start_group_exit_locked(signal, sig); return; } } -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 4/9] signal: Factor start_group_exit out of complete_signal 2021-06-24 19:01 ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman @ 2021-06-24 20:04 ` Linus Torvalds 2021-06-26 3:24 ` Kees Cook 1 sibling, 0 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-24 20:04 UTC (permalink / raw) To: Eric W. Biederman Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook I don't really mind the patch, but this patch doesn't actually do what it says it does. It factors out start_group_exit_locked() - which all looks good. But then it also creates that new start_group_exit() function and makes the declaration for it, and nothing actually uses it. Yet. I'd do that second part later when you actually introduce the use in the next patch (5/9). Hmm? Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 4/9] signal: Factor start_group_exit out of complete_signal 2021-06-24 19:01 ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman 2021-06-24 20:04 ` Linus Torvalds @ 2021-06-26 3:24 ` Kees Cook 1 sibling, 0 replies; 119+ messages in thread From: Kees Cook @ 2021-06-26 3:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo On Thu, Jun 24, 2021 at 02:01:20PM -0500, Eric W. Biederman wrote: > +static void start_group_exit_locked(struct signal_struct *signal, int exit_code) > +{ > + /* > + * Start a group exit and wake everybody up. > + * This way we don't have other threads > + * running and doing things after a slower > + * thread has the fatal signal pending. > + */ > + struct task_struct *t; > + > + signal->flags = SIGNAL_GROUP_EXIT; > + signal->group_exit_code = exit_code; > + signal->group_stop_count = 0; > + __for_each_thread(signal, t) { > + task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); > + > + /* Don't bother with already dead threads */ > + if (t->exit_state) > + continue; > + sigaddset(&t->pending.signal, SIGKILL); > + signal_wake_up(t, 1); > + } This both extracts it and changes it. For ease-of-review, maybe split this patch into the move and then the logic changes? -- Kees Cook ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman ` (3 preceding siblings ...) 2021-06-24 19:01 ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman @ 2021-06-24 19:01 ` Eric W. Biederman 2021-06-26 3:35 ` Kees Cook 2021-06-24 19:02 ` [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads Eric W. Biederman ` (4 subsequent siblings) 9 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 19:01 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Make thread exiting uniform by causing all threads to pass through get_signal when they are exiting. This simplifies the analysis of sychronization during exit and guarantees that all full set of registers will be available for ptrace to examine for threads that stop at PTRACE_EVENT_EXIT. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/exit.c | 4 ++-- kernel/seccomp.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index fd1c04193e18..921519d80b56 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -931,8 +931,8 @@ do_group_exit(int exit_code) */ SYSCALL_DEFINE1(exit_group, int, error_code) { - do_group_exit((error_code & 0xff) << 8); - /* NOTREACHED */ + start_group_exit((error_code & 0xff) << 8); + /* get_signal will call do_exit */ return 0; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 5301eca670a0..b1c06fd1b205 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1250,7 +1250,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, if (action == SECCOMP_RET_KILL_THREAD) do_exit(SIGSYS); else - do_group_exit(SIGSYS); + start_group_exit(SIGSYS); } return -1; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit 2021-06-24 19:01 ` [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit Eric W. Biederman @ 2021-06-26 3:35 ` Kees Cook 0 siblings, 0 replies; 119+ messages in thread From: Kees Cook @ 2021-06-26 3:35 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo On Thu, Jun 24, 2021 at 02:01:40PM -0500, Eric W. Biederman wrote: > > Make thread exiting uniform by causing all threads to pass through > get_signal when they are exiting. This simplifies the analysis > of sychronization during exit and guarantees that all full set > of registers will be available for ptrace to examine for > threads that stop at PTRACE_EVENT_EXIT. Yeah, cool. I do like making the process lifetime more sensible here. It always threw me that do_exit*() just stopped execution. :) For future me, can you add a comment on start_group_exit() that mentions where final process death happens? > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > kernel/exit.c | 4 ++-- > kernel/seccomp.c | 2 +- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/kernel/exit.c b/kernel/exit.c > index fd1c04193e18..921519d80b56 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -931,8 +931,8 @@ do_group_exit(int exit_code) > */ > SYSCALL_DEFINE1(exit_group, int, error_code) > { > - do_group_exit((error_code & 0xff) << 8); > - /* NOTREACHED */ > + start_group_exit((error_code & 0xff) << 8); > + /* get_signal will call do_exit */ > return 0; "0" feels weird here, but I can't think of any better "fail closed" error code here. > } > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 5301eca670a0..b1c06fd1b205 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -1250,7 +1250,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, > if (action == SECCOMP_RET_KILL_THREAD) > do_exit(SIGSYS); > else > - do_group_exit(SIGSYS); > + start_group_exit(SIGSYS); This could use a similar comment to the syscall's comment, just so I don't panic when I read this code in like 3 years. ;) Otherwise, yeah, looks good. -Kees > } > return -1; > } > -- > 2.20.1 > -- Kees Cook ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman ` (4 preceding siblings ...) 2021-06-24 19:01 ` [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit Eric W. Biederman @ 2021-06-24 19:02 ` Eric W. Biederman 2021-06-26 3:42 ` Kees Cook 2021-06-24 19:02 ` [PATCH 7/9] signal: Make individual tasks exiting a first class concept Eric W. Biederman ` (3 subsequent siblings) 9 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 19:02 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Forld do_group_exit into get_signal as it is the last caller. Move the group_exit logic above the PF_IO_WORKER exit, ensuring that if an PF_IO_WORKER catches SIGKILL every thread in the thread group will exit not just the the PF_IO_WORKER. Now that the information is easily available only set PF_SIGNALED when it was a signal that caused the exit. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- include/linux/sched/task.h | 1 - kernel/exit.c | 31 ------------------------------- kernel/signal.c | 35 +++++++++++++++++++++++++---------- 3 files changed, 25 insertions(+), 42 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index ef02be869cf2..45525512e3d0 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk) { } #endif -extern void do_group_exit(int); extern void exit_files(struct task_struct *); extern void exit_itimers(struct signal_struct *); diff --git a/kernel/exit.c b/kernel/exit.c index 921519d80b56..635f434122b7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code) do_exit((error_code&0xff)<<8); } -/* - * Take down every thread in the group. This is called by fatal signals - * as well as by sys_exit_group (below). - */ -void -do_group_exit(int exit_code) -{ - struct signal_struct *sig = current->signal; - - BUG_ON(exit_code & 0x80); /* core dumps don't get here */ - - if (signal_group_exit(sig)) - exit_code = sig->group_exit_code; - else if (!thread_group_empty(current)) { - struct sighand_struct *const sighand = current->sighand; - - spin_lock_irq(&sighand->siglock); - if (signal_group_exit(sig)) - /* Another thread got here before we took the lock. */ - exit_code = sig->group_exit_code; - else { - sig->group_exit_code = exit_code; - sig->flags = SIGNAL_GROUP_EXIT; - zap_other_threads(current); - } - spin_unlock_irq(&sighand->siglock); - } - - do_exit(exit_code); - /* NOTREACHED */ -} /* * this kills every thread in the thread group. Note that any externally diff --git a/kernel/signal.c b/kernel/signal.c index c79c010ca5f3..95a076af600a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig) { struct sighand_struct *sighand = current->sighand; struct signal_struct *signal = current->signal; + int exit_code; int signr; if (unlikely(current->task_works)) @@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig) /* * Anything else is fatal, maybe with a core dump. */ - current->flags |= PF_SIGNALED; - if (sig_kernel_coredump(signr)) { if (print_fatal_signals) print_fatal_signal(ksig->info.si_signo); @@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig) /* * If it was able to dump core, this kills all * other threads in the group and synchronizes with - * their demise. If we lost the race with another - * thread getting here, it set group_exit_code - * first and our do_group_exit call below will use - * that value and ignore the one we pass it. + * their demise. If another thread makes it + * to do_coredump first, it will set group_exit_code + * which will be passed to do_exit. */ do_coredump(&ksig->info); } + /* + * Death signals, no core dump. + */ + exit_code = signr; + if (signal_group_exit(signal)) { + exit_code = signal->group_exit_code; + } else { + spin_lock_irq(&sighand->siglock); + if (signal_group_exit(signal)) { + /* Another thread got here before we took the lock. */ + exit_code = signal->group_exit_code; + } else { + start_group_exit_locked(signal, exit_code); + } + spin_unlock_irq(&sighand->siglock); + } + + if (exit_code & 0x7f) + current->flags |= PF_SIGNALED; + /* * PF_IO_WORKER threads will catch and exit on fatal signals * themselves. They have cleanup that must be performed, so @@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig) if (current->flags & PF_IO_WORKER) goto out; - /* - * Death signals, no core dump. - */ - do_group_exit(ksig->info.si_signo); + do_exit(exit_code); /* NOTREACHED */ } spin_unlock_irq(&sighand->siglock); -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads 2021-06-24 19:02 ` [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads Eric W. Biederman @ 2021-06-26 3:42 ` Kees Cook 2021-06-28 19:25 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Kees Cook @ 2021-06-26 3:42 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo On Thu, Jun 24, 2021 at 02:02:16PM -0500, Eric W. Biederman wrote: > > Forld do_group_exit into get_signal as it is the last caller. > > Move the group_exit logic above the PF_IO_WORKER exit, ensuring > that if an PF_IO_WORKER catches SIGKILL every thread in > the thread group will exit not just the the PF_IO_WORKER. > > Now that the information is easily available only set PF_SIGNALED > when it was a signal that caused the exit. > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > include/linux/sched/task.h | 1 - > kernel/exit.c | 31 ------------------------------- > kernel/signal.c | 35 +++++++++++++++++++++++++---------- > 3 files changed, 25 insertions(+), 42 deletions(-) > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h > index ef02be869cf2..45525512e3d0 100644 > --- a/include/linux/sched/task.h > +++ b/include/linux/sched/task.h > @@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk) > { > } > #endif > -extern void do_group_exit(int); > > extern void exit_files(struct task_struct *); > extern void exit_itimers(struct signal_struct *); > diff --git a/kernel/exit.c b/kernel/exit.c > index 921519d80b56..635f434122b7 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code) > do_exit((error_code&0xff)<<8); > } > > -/* > - * Take down every thread in the group. This is called by fatal signals > - * as well as by sys_exit_group (below). > - */ > -void > -do_group_exit(int exit_code) > -{ > - struct signal_struct *sig = current->signal; > - > - BUG_ON(exit_code & 0x80); /* core dumps don't get here */ > - > - if (signal_group_exit(sig)) > - exit_code = sig->group_exit_code; > - else if (!thread_group_empty(current)) { > - struct sighand_struct *const sighand = current->sighand; > - > - spin_lock_irq(&sighand->siglock); > - if (signal_group_exit(sig)) > - /* Another thread got here before we took the lock. */ > - exit_code = sig->group_exit_code; > - else { > - sig->group_exit_code = exit_code; > - sig->flags = SIGNAL_GROUP_EXIT; > - zap_other_threads(current); Oh, now I see it: the "new code" in start_group_exit() is an open-coded zap_other_threads()? That wasn't clear to me, but makes sense now. > - } > - spin_unlock_irq(&sighand->siglock); > - } > - > - do_exit(exit_code); > - /* NOTREACHED */ > -} > > /* > * this kills every thread in the thread group. Note that any externally > diff --git a/kernel/signal.c b/kernel/signal.c > index c79c010ca5f3..95a076af600a 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig) > { > struct sighand_struct *sighand = current->sighand; > struct signal_struct *signal = current->signal; > + int exit_code; > int signr; > > if (unlikely(current->task_works)) > @@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig) > /* > * Anything else is fatal, maybe with a core dump. > */ > - current->flags |= PF_SIGNALED; > - > if (sig_kernel_coredump(signr)) { > if (print_fatal_signals) > print_fatal_signal(ksig->info.si_signo); > @@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig) > /* > * If it was able to dump core, this kills all > * other threads in the group and synchronizes with > - * their demise. If we lost the race with another > - * thread getting here, it set group_exit_code > - * first and our do_group_exit call below will use > - * that value and ignore the one we pass it. > + * their demise. If another thread makes it > + * to do_coredump first, it will set group_exit_code > + * which will be passed to do_exit. > */ > do_coredump(&ksig->info); > } > > + /* > + * Death signals, no core dump. > + */ > + exit_code = signr; > + if (signal_group_exit(signal)) { > + exit_code = signal->group_exit_code; > + } else { > + spin_lock_irq(&sighand->siglock); > + if (signal_group_exit(signal)) { > + /* Another thread got here before we took the lock. */ > + exit_code = signal->group_exit_code; > + } else { > + start_group_exit_locked(signal, exit_code); And here's the "if we didn't already do start_group_exit(), do it here". And that state is entirely captured via the SIGNAL_GROUP_EXIT flag. Cool. > + } > + spin_unlock_irq(&sighand->siglock); > + } > + > + if (exit_code & 0x7f) > + current->flags |= PF_SIGNALED; > + > /* > * PF_IO_WORKER threads will catch and exit on fatal signals > * themselves. They have cleanup that must be performed, so > @@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig) > if (current->flags & PF_IO_WORKER) > goto out; > > - /* > - * Death signals, no core dump. > - */ > - do_group_exit(ksig->info.si_signo); > + do_exit(exit_code); > /* NOTREACHED */ > } > spin_unlock_irq(&sighand->siglock); > -- > 2.20.1 > -- Kees Cook ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads 2021-06-26 3:42 ` Kees Cook @ 2021-06-28 19:25 ` Eric W. Biederman 0 siblings, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-28 19:25 UTC (permalink / raw) To: Kees Cook Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo Kees Cook <keescook@chromium.org> writes: > On Thu, Jun 24, 2021 at 02:02:16PM -0500, Eric W. Biederman wrote: >> >> Forld do_group_exit into get_signal as it is the last caller. >> >> Move the group_exit logic above the PF_IO_WORKER exit, ensuring >> that if an PF_IO_WORKER catches SIGKILL every thread in >> the thread group will exit not just the the PF_IO_WORKER. >> >> Now that the information is easily available only set PF_SIGNALED >> when it was a signal that caused the exit. >> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> --- >> include/linux/sched/task.h | 1 - >> kernel/exit.c | 31 ------------------------------- >> kernel/signal.c | 35 +++++++++++++++++++++++++---------- >> 3 files changed, 25 insertions(+), 42 deletions(-) >> >> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h >> index ef02be869cf2..45525512e3d0 100644 >> --- a/include/linux/sched/task.h >> +++ b/include/linux/sched/task.h >> @@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk) >> { >> } >> #endif >> -extern void do_group_exit(int); >> >> extern void exit_files(struct task_struct *); >> extern void exit_itimers(struct signal_struct *); >> diff --git a/kernel/exit.c b/kernel/exit.c >> index 921519d80b56..635f434122b7 100644 >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code) >> do_exit((error_code&0xff)<<8); >> } >> >> -/* >> - * Take down every thread in the group. This is called by fatal signals >> - * as well as by sys_exit_group (below). >> - */ >> -void >> -do_group_exit(int exit_code) >> -{ >> - struct signal_struct *sig = current->signal; >> - >> - BUG_ON(exit_code & 0x80); /* core dumps don't get here */ >> - >> - if (signal_group_exit(sig)) >> - exit_code = sig->group_exit_code; >> - else if (!thread_group_empty(current)) { >> - struct sighand_struct *const sighand = current->sighand; >> - >> - spin_lock_irq(&sighand->siglock); >> - if (signal_group_exit(sig)) >> - /* Another thread got here before we took the lock. */ >> - exit_code = sig->group_exit_code; >> - else { >> - sig->group_exit_code = exit_code; >> - sig->flags = SIGNAL_GROUP_EXIT; >> - zap_other_threads(current); > > Oh, now I see it: the "new code" in start_group_exit() is an open-coded > zap_other_threads()? That wasn't clear to me, but makes sense now. Pretty much. I think zap_other_threads has actually muddied the waters quite a bit by putting reuse in the wrong place. >> - } >> - spin_unlock_irq(&sighand->siglock); >> - } >> - >> - do_exit(exit_code); >> - /* NOTREACHED */ >> -} >> >> /* >> * this kills every thread in the thread group. Note that any externally >> diff --git a/kernel/signal.c b/kernel/signal.c >> index c79c010ca5f3..95a076af600a 100644 >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig) >> { >> struct sighand_struct *sighand = current->sighand; >> struct signal_struct *signal = current->signal; >> + int exit_code; >> int signr; >> >> if (unlikely(current->task_works)) >> @@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig) >> /* >> * Anything else is fatal, maybe with a core dump. >> */ >> - current->flags |= PF_SIGNALED; >> - >> if (sig_kernel_coredump(signr)) { >> if (print_fatal_signals) >> print_fatal_signal(ksig->info.si_signo); >> @@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig) >> /* >> * If it was able to dump core, this kills all >> * other threads in the group and synchronizes with >> - * their demise. If we lost the race with another >> - * thread getting here, it set group_exit_code >> - * first and our do_group_exit call below will use >> - * that value and ignore the one we pass it. >> + * their demise. If another thread makes it >> + * to do_coredump first, it will set group_exit_code >> + * which will be passed to do_exit. >> */ >> do_coredump(&ksig->info); >> } >> >> + /* >> + * Death signals, no core dump. >> + */ >> + exit_code = signr; >> + if (signal_group_exit(signal)) { >> + exit_code = signal->group_exit_code; >> + } else { >> + spin_lock_irq(&sighand->siglock); >> + if (signal_group_exit(signal)) { >> + /* Another thread got here before we took the lock. */ >> + exit_code = signal->group_exit_code; >> + } else { >> + start_group_exit_locked(signal, exit_code); > > And here's the "if we didn't already do start_group_exit(), do it here". > And that state is entirely captured via the SIGNAL_GROUP_EXIT flag. > Cool. Yes. At least when the dust clears. >> + } >> + spin_unlock_irq(&sighand->siglock); >> + } >> + >> + if (exit_code & 0x7f) >> + current->flags |= PF_SIGNALED; >> + >> /* >> * PF_IO_WORKER threads will catch and exit on fatal signals >> * themselves. They have cleanup that must be performed, so >> @@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig) >> if (current->flags & PF_IO_WORKER) >> goto out; >> >> - /* >> - * Death signals, no core dump. >> - */ >> - do_group_exit(ksig->info.si_signo); >> + do_exit(exit_code); >> /* NOTREACHED */ >> } >> spin_unlock_irq(&sighand->siglock); >> -- >> 2.20.1 >> ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 7/9] signal: Make individual tasks exiting a first class concept. 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman ` (5 preceding siblings ...) 2021-06-24 19:02 ` [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads Eric W. Biederman @ 2021-06-24 19:02 ` Eric W. Biederman 2021-06-24 20:11 ` Linus Torvalds 2021-06-24 19:03 ` [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit Eric W. Biederman ` (2 subsequent siblings) 9 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 19:02 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Implement start_task_exit_locked and rewrite the de_thread logic in exec using it. Calling start_task_exit_locked is equivalent to asyncrhonously calling exit(2) aka pthread_exit on a task. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/exec.c | 10 +++++++++- include/linux/sched/jobctl.h | 2 ++ include/linux/sched/signal.h | 1 + kernel/signal.c | 37 ++++++++++++++++-------------------- 4 files changed, 28 insertions(+), 22 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 18594f11c31f..b6f50213f0a0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1040,6 +1040,7 @@ static int de_thread(struct task_struct *tsk) struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + struct task_struct *t; if (thread_group_empty(tsk)) goto no_thread_group; @@ -1058,7 +1059,14 @@ static int de_thread(struct task_struct *tsk) } sig->group_exit_task = tsk; - sig->notify_count = zap_other_threads(tsk); + sig->group_stop_count = 0; + sig->notify_count = 0; + __for_each_thread(sig, t) { + if (t == tsk) + continue; + sig->notify_count++; + start_task_exit_locked(t, SIGKILL); + } if (!thread_group_leader(tsk)) sig->notify_count--; diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h index fa067de9f1a9..e94833b0c819 100644 --- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -19,6 +19,7 @@ struct task_struct; #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ #define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */ +#define JOBCTL_TASK_EXITING_BIT 31 /* the task is exiting */ #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) @@ -28,6 +29,7 @@ struct task_struct; #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT) +#define JOBCTL_TASK_EXITING (1UL << JOBCTL_TASK_EXITING_BIT) #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index c007e55cb119..a958381ba4a9 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -429,6 +429,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume) } void start_group_exit(int exit_code); +void start_task_exit_locked(struct task_struct *task, int exit_code); void task_join_group_stop(struct task_struct *task); diff --git a/kernel/signal.c b/kernel/signal.c index 95a076af600a..afbc001220dd 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -264,6 +264,12 @@ static inline void print_dropped_signal(int sig) current->comm, current->pid, sig); } +static void task_set_jobctl_exiting(struct task_struct *task, int exit_code) +{ + WARN_ON_ONCE(task->jobctl & ~JOBCTL_STOP_SIGMASK); + task->jobctl = JOBCTL_TASK_EXITING | (exit_code & JOBCTL_STOP_SIGMASK); +} + /** * task_set_jobctl_pending - set jobctl pending bits * @task: target task @@ -1407,28 +1413,15 @@ int force_sig_info(struct kernel_siginfo *info) return force_sig_info_to_task(info, current, false); } -/* - * Nuke all other threads in the group. - */ -int zap_other_threads(struct task_struct *p) +void start_task_exit_locked(struct task_struct *task, int exit_code) { - struct task_struct *t = p; - int count = 0; - - p->signal->group_stop_count = 0; - - while_each_thread(p, t) { - task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - count++; - - /* Don't bother with already dead threads */ - if (t->exit_state) - continue; - sigaddset(&t->pending.signal, SIGKILL); - signal_wake_up(t, 1); + task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK); + /* Only bother with threads that might be alive */ + if (!task->exit_state) { + task_set_jobctl_exiting(task, exit_code); + sigaddset(&task->pending.signal, SIGKILL); + signal_wake_up(task, 1); } - - return count; } struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, @@ -2714,7 +2707,7 @@ bool get_signal(struct ksignal *ksig) } /* Has this task already been marked for death? */ - if (signal_group_exit(signal)) { + if (signal_group_exit(signal) || (current->jobctl & JOBCTL_TASK_EXITING)) { ksig->info.si_signo = signr = SIGKILL; sigdelset(¤t->pending.signal, SIGKILL); trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO, @@ -2874,6 +2867,8 @@ bool get_signal(struct ksignal *ksig) if (signal_group_exit(signal)) { /* Another thread got here before we took the lock. */ exit_code = signal->group_exit_code; + } else if (current->jobctl & JOBCTL_TASK_EXITING) { + exit_code = current->jobctl & JOBCTL_STOP_SIGMASK; } else { start_group_exit_locked(signal, exit_code); } -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 7/9] signal: Make individual tasks exiting a first class concept. 2021-06-24 19:02 ` [PATCH 7/9] signal: Make individual tasks exiting a first class concept Eric W. Biederman @ 2021-06-24 20:11 ` Linus Torvalds 2021-06-24 21:37 ` Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-24 20:11 UTC (permalink / raw) To: Eric W. Biederman Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Thu, Jun 24, 2021 at 12:03 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Implement start_task_exit_locked and rewrite the de_thread logic > in exec using it. > > Calling start_task_exit_locked is equivalent to asyncrhonously > calling exit(2) aka pthread_exit on a task. Ok, so this is the patch that makes me go "Yeah, this seems to all go together". The whole "start_exit()" thing seemed fairly sane as an interesting concept to the whole ptrace notification thing, but this one actually made me think it makes conceptual sense and how we had exactly that "start exit asynchronously" case already in zap_other_threads(). So doing that zap_other_threads() as that async exit makes me just thin kthat yes, this series is the right thing, because it not only cleans up the ptrace condition, it makes sense in this entirely unrelated area too. So I think I'm convinced. I'd like Oleg in particular to Ack this series, and Al to look it over, but I do think this is the right direction. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 7/9] signal: Make individual tasks exiting a first class concept. 2021-06-24 20:11 ` Linus Torvalds @ 2021-06-24 21:37 ` Eric W. Biederman 0 siblings, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 21:37 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Jun 24, 2021 at 12:03 PM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> Implement start_task_exit_locked and rewrite the de_thread logic >> in exec using it. >> >> Calling start_task_exit_locked is equivalent to asyncrhonously >> calling exit(2) aka pthread_exit on a task. > > Ok, so this is the patch that makes me go "Yeah, this seems to all go together". > > The whole "start_exit()" thing seemed fairly sane as an interesting > concept to the whole ptrace notification thing, but this one actually > made me think it makes conceptual sense and how we had exactly that > "start exit asynchronously" case already in zap_other_threads(). > > So doing that zap_other_threads() as that async exit makes me just > thin kthat yes, this series is the right thing, because it not only > cleans up the ptrace condition, it makes sense in this entirely > unrelated area too. > > So I think I'm convinced. I'd like Oleg in particular to Ack this > series, and Al to look it over, but I do think this is the right > direction. Thanks. It took a bit of exploration and playing with things to get here, but I had the same sense. Next round I will see if I can clean up the patches a bit more. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman ` (6 preceding siblings ...) 2021-06-24 19:02 ` [PATCH 7/9] signal: Make individual tasks exiting a first class concept Eric W. Biederman @ 2021-06-24 19:03 ` Eric W. Biederman 2021-06-26 5:56 ` Kees Cook 2021-06-24 19:03 ` [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal Eric W. Biederman 2021-06-24 22:45 ` [PATCH 0/9] Refactoring exit Al Viro 9 siblings, 1 reply; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 19:03 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Reuse start_task_exit_locked to implement start_task_exit. Simplify the exit logic by having all exits go through get_signal. This simplifies the analysis of syncrhonization during exit and gurantees a full set of registers will be available for ptrace to examine at PTRACE_EVENT_EXIT. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- include/linux/sched/signal.h | 1 + kernel/exit.c | 4 +++- kernel/seccomp.c | 2 +- kernel/signal.c | 12 ++++++++++++ 4 files changed, 17 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index a958381ba4a9..3f4e69c019b7 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -430,6 +430,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume) void start_group_exit(int exit_code); void start_task_exit_locked(struct task_struct *task, int exit_code); +void start_task_exit(int exit_code); void task_join_group_stop(struct task_struct *task); diff --git a/kernel/exit.c b/kernel/exit.c index 635f434122b7..51e0c82b3f7d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit); SYSCALL_DEFINE1(exit, int, error_code) { - do_exit((error_code&0xff)<<8); + start_task_exit((error_code&0xff)<<8); + /* get_signal will call do_exit */ + return 0; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b1c06fd1b205..e0c4c123a8bf 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1248,7 +1248,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, force_sig_seccomp(this_syscall, data, true); } else { if (action == SECCOMP_RET_KILL_THREAD) - do_exit(SIGSYS); + start_task_exit(SIGSYS); else start_group_exit(SIGSYS); } diff --git a/kernel/signal.c b/kernel/signal.c index afbc001220dd..63fda9b6bbf9 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1424,6 +1424,18 @@ void start_task_exit_locked(struct task_struct *task, int exit_code) } } +void start_task_exit(int exit_code) +{ + struct task_struct *task = current; + if (!fatal_signal_pending(task)) { + struct sighand_struct *const sighand = task->sighand; + spin_lock_irq(&sighand->siglock); + if (!fatal_signal_pending(current)) + start_task_exit_locked(task, exit_code); + spin_unlock_irq(&sighand->siglock); + } +} + struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, unsigned long *flags) { -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit 2021-06-24 19:03 ` [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit Eric W. Biederman @ 2021-06-26 5:56 ` Kees Cook 0 siblings, 0 replies; 119+ messages in thread From: Kees Cook @ 2021-06-26 5:56 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo On Thu, Jun 24, 2021 at 02:03:25PM -0500, Eric W. Biederman wrote: > > Reuse start_task_exit_locked to implement start_task_exit. > > Simplify the exit logic by having all exits go through get_signal. > This simplifies the analysis of syncrhonization during exit and > gurantees a full set of registers will be available for ptrace to > examine at PTRACE_EVENT_EXIT. > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > include/linux/sched/signal.h | 1 + > kernel/exit.c | 4 +++- > kernel/seccomp.c | 2 +- > kernel/signal.c | 12 ++++++++++++ > 4 files changed, 17 insertions(+), 2 deletions(-) > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > index a958381ba4a9..3f4e69c019b7 100644 > --- a/include/linux/sched/signal.h > +++ b/include/linux/sched/signal.h > @@ -430,6 +430,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume) > > void start_group_exit(int exit_code); > void start_task_exit_locked(struct task_struct *task, int exit_code); > +void start_task_exit(int exit_code); > > void task_join_group_stop(struct task_struct *task); > > diff --git a/kernel/exit.c b/kernel/exit.c > index 635f434122b7..51e0c82b3f7d 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit); > > SYSCALL_DEFINE1(exit, int, error_code) > { > - do_exit((error_code&0xff)<<8); > + start_task_exit((error_code&0xff)<<8); > + /* get_signal will call do_exit */ > + return 0; > } > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index b1c06fd1b205..e0c4c123a8bf 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -1248,7 +1248,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, > force_sig_seccomp(this_syscall, data, true); > } else { > if (action == SECCOMP_RET_KILL_THREAD) > - do_exit(SIGSYS); > + start_task_exit(SIGSYS); > else > start_group_exit(SIGSYS); > } Looks good, yeah. > diff --git a/kernel/signal.c b/kernel/signal.c > index afbc001220dd..63fda9b6bbf9 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1424,6 +1424,18 @@ void start_task_exit_locked(struct task_struct *task, int exit_code) > } > } > > +void start_task_exit(int exit_code) > +{ > + struct task_struct *task = current; > + if (!fatal_signal_pending(task)) { > + struct sighand_struct *const sighand = task->sighand; > + spin_lock_irq(&sighand->siglock); > + if (!fatal_signal_pending(current)) efficiency nit: "task" instead of "current" here, yes? > + start_task_exit_locked(task, exit_code); > + spin_unlock_irq(&sighand->siglock); > + } > +} > + > struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, > unsigned long *flags) > { > -- > 2.20.1 > -- Kees Cook ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman ` (7 preceding siblings ...) 2021-06-24 19:03 ` [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit Eric W. Biederman @ 2021-06-24 19:03 ` Eric W. Biederman 2021-06-24 22:45 ` [PATCH 0/9] Refactoring exit Al Viro 9 siblings, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-24 19:03 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook This ensures that we always have all full set of registers available when PTRACE_EVENT_EXIT is called. Something that is not guaranteed for callers of do_exit. Additionally this guarantees PTRACE_EVENT_EXIT will not cause havoc with abnormal exits. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- kernel/exit.c | 2 -- kernel/signal.c | 2 ++ 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index 51e0c82b3f7d..309f1d71e340 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -763,8 +763,6 @@ void __noreturn do_exit(long code) profile_task_exit(tsk); kcov_task_exit(tsk); - ptrace_event(PTRACE_EVENT_EXIT, code); - validate_creds_for_do_exit(tsk); /* diff --git a/kernel/signal.c b/kernel/signal.c index 63fda9b6bbf9..7214331836bc 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2890,6 +2890,8 @@ bool get_signal(struct ksignal *ksig) if (exit_code & 0x7f) current->flags |= PF_SIGNALED; + ptrace_event(PTRACE_EVENT_EXIT, exit_code); + /* * PF_IO_WORKER threads will catch and exit on fatal signals * themselves. They have cleanup that must be performed, so -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman ` (8 preceding siblings ...) 2021-06-24 19:03 ` [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal Eric W. Biederman @ 2021-06-24 22:45 ` Al Viro 2021-06-27 22:13 ` Al Viro 2021-06-28 19:02 ` [PATCH 0/9] Refactoring exit Eric W. Biederman 9 siblings, 2 replies; 119+ messages in thread From: Al Viro @ 2021-06-24 22:45 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Thu, Jun 24, 2021 at 01:57:35PM -0500, Eric W. Biederman wrote: > So far the code has been lightly tested, and the descriptions of some > of the patches are a bit light, but I think this shows the direction > I am aiming to travel for sorting out exit(2) and exit_group(2). FWIW, here's the current picture for do_exit(), aside of exit(2) and do_exit_group(): 1) stuff that is clearly oops-like - alpha:die_if_kernel() alpha:do_entUna() alpha:do_page_fault() arm:oops_end() arm:__do_kernel_fault() arm64:die() arm64:die_kernel_fault() csky:alignment() csky:die() csky:no_context() h8300:die() h8300:do_page_fault() hexagon:die() ia64:die() i64:ia64_do_page_fault() m68k:die_if_kernel() m68k:send_fault_sig() microblaze:die() mips:die() nds32:handle_fpu_exception() nds32:die() nds32:unhandled_interruption() nds32:unhandled_exceptions() nds32:do_revinsn() nds32:do_page_fault() nios:die() openrisc:die() openrisc:do_page_fault() parisc:die_if_kernel() ppc:oops_end() riscv:die() riscv:die_kernel_fault() s390:die() s390:do_no_context() s390:do_low_address() sh:die() sparc32:die_if_kernel() sparc32:do_sparc_fault() sparc64:die_if_kernel() x86:rewind_stack_do_exit() xtensa:die() xtensa:bad_page_fault() We really do not want ptrace anywhere near any of those and we do not want any of that to return; this shit would better be handled right there and there - no "post a fatal signal" would do. 2) sparc32 playing silly buggers with SIGILL in case when signal delivery can't get a valid sigframe. The regular variant for that kind of stuff is forced SIGSEGV from failure case of signal_setup_done(). We could force that SIGILL instead of do_exit() there (and report failure from sigframe setup), but I suspect that we'll get SIGSEGV override that SIGILL, with user-visible behaviour change. Triggered by altstack overflow on sparc32; sparc64 gets SIGSEGV in the same situation, just like everybody else. 3) ppc swapcontext(2). Normal syscall, on failure results in exit(SIGSEGV). Not sure if we want to post signal here - exposing the caller to results of failure might be... interesting. And I really don't know if we want to allow ptrace() to poke around in the results of such failure. That's a question for ppc maintainers. 4) sparc32:try_to_clear_window_buffer(). Probably could force SIGSEGV instead of do_exit() there, but that might need a bit of massage in asm glue - it's called on the way out of kernel, right before handling signals. I'd like comments from davem on that one, though. 5) in xtensa fast_syscall_spill_registers() stuff. Might or might not be similar to the above. 6) sparc64 in tsb_grow() - looks like "impossible case, kill the sucker dead if that ever happens". Not sure if it's reachable at all. 7) s390 copy_thread() is doing something interesting in kernel thread case - frame->childregs.gprs[11] = (unsigned long)do_exit; AFAICS, had been unused since 30dcb0996e40, when s390 switched to new kernel_execve() semantics and kernel_thread_starter stopped using r11 (or proceeding to do_exit() in the first place). Ought to be removed, if s390 folks ACK that. 8) x86:emulate_vsyscall(), x86:save_v86_state(), m68k:fpsp040_die(), mips:bad_stack(), s390:__s390_handle_mcck(), ia64:mca_handler_bh(), s390:default_trap_handler() - fuck knows. 9) seccomp stuff - this one should *NOT* be switched to posting signals; it's on syscall_trace_enter() paths and we'd better have signal-equivalent environment there. We sure as hell do have regular "stop and let tracer poke around" in the same area - that's where strace is poking around. 10) there's a (moderate) bunch of places all over the tree where we have kthread() payload hit do_exit(), with or without complete() or module_put(). No ptrace stuff is going to be hit there and I see no point in switching those to posting anything. In particular, module_put_and_exit() sure as hell does *NOT* want to return to caller - it might've been unmapped by the point we are done. This do_exit() should really be noreturn. 11) abuses in kernel/kthread.c; AFAICS, it's misused as a mechanism to return an error value to parent. No ptrace possible (parent definitely not traced) and I don't see any point in delaying the handling of that do_exit() either (same as with the execve failure in call_usermodehelper_exec_async()). 12) io-uring threads hitting do_exit(). These, apparently, can be ptraced... 13) there's bdflush(1, whatever), which is equivalent to exit(0). IMO it's long past the time to simply remove the sucker. 14) reboot(2) stuff. No idea. 15) syscall_user_dispatch(). Didn't have time to look through that stuff in details yet, so no idea at the moment. ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-24 22:45 ` [PATCH 0/9] Refactoring exit Al Viro @ 2021-06-27 22:13 ` Al Viro 2021-06-27 22:59 ` Michael Schmitz 2021-06-28 19:02 ` [PATCH 0/9] Refactoring exit Eric W. Biederman 1 sibling, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-27 22:13 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote: > 13) there's bdflush(1, whatever), which is equivalent to exit(0). > IMO it's long past the time to simply remove the sucker. Incidentally, calling that from ptraced process on alpha leads to the same headache for tracer. _If_ we leave it around, this is another candidate for "hit yourself with that special signal" - both alpha and m68k have that syscall, and IMO adding an asm wrapper for that one is over the top. Said that, we really ought to bury that thing: commit 2f268ee88abb33968501a44368db55c63adaad40 Author: Andrew Morton <akpm@digeo.com> Date: Sat Dec 14 03:16:29 2002 -0800 [PATCH] deprecate use of bdflush() Patch from Robert Love <rml@tech9.net> We can never get rid of it if we do not deprecate it - so do so and print a stern warning to those who still run bdflush daemons. Deprecated for 18.5 years by now - I seriously suspect that we have some contributors younger than that... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-27 22:13 ` Al Viro @ 2021-06-27 22:59 ` Michael Schmitz 2021-06-28 7:31 ` Geert Uytterhoeven 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-27 22:59 UTC (permalink / raw) To: Al Viro, Eric W. Biederman Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook On 28/06/21 10:13 am, Al Viro wrote: > On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote: > >> 13) there's bdflush(1, whatever), which is equivalent to exit(0). >> IMO it's long past the time to simply remove the sucker. > Incidentally, calling that from ptraced process on alpha leads to > the same headache for tracer. _If_ we leave it around, this is > another candidate for "hit yourself with that special signal" - > both alpha and m68k have that syscall, and IMO adding an asm > wrapper for that one is over the top. > > Said that, we really ought to bury that thing: > > commit 2f268ee88abb33968501a44368db55c63adaad40 > Author: Andrew Morton <akpm@digeo.com> > Date: Sat Dec 14 03:16:29 2002 -0800 > > [PATCH] deprecate use of bdflush() > > Patch from Robert Love <rml@tech9.net> > > We can never get rid of it if we do not deprecate it - so do so and > print a stern warning to those who still run bdflush daemons. > > Deprecated for 18.5 years by now - I seriously suspect that we have > some contributors younger than that... Haven't found that warning in over 7 years' worth of console logs, and I'm a good candidate for running the oldest userland in existence for m68k. Time to let it go. Cheers, Michael ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-27 22:59 ` Michael Schmitz @ 2021-06-28 7:31 ` Geert Uytterhoeven 2021-06-28 16:20 ` Eric W. Biederman 2021-06-28 17:14 ` Michael Schmitz 0 siblings, 2 replies; 119+ messages in thread From: Geert Uytterhoeven @ 2021-06-28 7:31 UTC (permalink / raw) To: Michael Schmitz Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook Hi Michael, On Mon, Jun 28, 2021 at 1:00 AM Michael Schmitz <schmitzmic@gmail.com> wrote: > On 28/06/21 10:13 am, Al Viro wrote: > > On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote: > > > >> 13) there's bdflush(1, whatever), which is equivalent to exit(0). > >> IMO it's long past the time to simply remove the sucker. > > Incidentally, calling that from ptraced process on alpha leads to > > the same headache for tracer. _If_ we leave it around, this is > > another candidate for "hit yourself with that special signal" - > > both alpha and m68k have that syscall, and IMO adding an asm > > wrapper for that one is over the top. > > > > Said that, we really ought to bury that thing: > > > > commit 2f268ee88abb33968501a44368db55c63adaad40 > > Author: Andrew Morton <akpm@digeo.com> > > Date: Sat Dec 14 03:16:29 2002 -0800 > > > > [PATCH] deprecate use of bdflush() > > > > Patch from Robert Love <rml@tech9.net> > > > > We can never get rid of it if we do not deprecate it - so do so and > > print a stern warning to those who still run bdflush daemons. > > > > Deprecated for 18.5 years by now - I seriously suspect that we have > > some contributors younger than that... > > Haven't found that warning in over 7 years' worth of console logs, and > I'm a good candidate for running the oldest userland in existence for m68k. > > Time to let it go. The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz, which is a very old ramdisk from right after the m68k a.out to ELF transition: warning: process `update' used the obsolete bdflush system call Fix your initscripts? I still boot it, once in a while. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-28 7:31 ` Geert Uytterhoeven @ 2021-06-28 16:20 ` Eric W. Biederman 2021-06-28 17:14 ` Michael Schmitz 1 sibling, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-28 16:20 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Michael Schmitz, Al Viro, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook Geert Uytterhoeven <geert@linux-m68k.org> writes: > Hi Michael, > > On Mon, Jun 28, 2021 at 1:00 AM Michael Schmitz <schmitzmic@gmail.com> wrote: >> On 28/06/21 10:13 am, Al Viro wrote: >> > On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote: >> > >> >> 13) there's bdflush(1, whatever), which is equivalent to exit(0). >> >> IMO it's long past the time to simply remove the sucker. >> > Incidentally, calling that from ptraced process on alpha leads to >> > the same headache for tracer. _If_ we leave it around, this is >> > another candidate for "hit yourself with that special signal" - >> > both alpha and m68k have that syscall, and IMO adding an asm >> > wrapper for that one is over the top. >> > >> > Said that, we really ought to bury that thing: >> > >> > commit 2f268ee88abb33968501a44368db55c63adaad40 >> > Author: Andrew Morton <akpm@digeo.com> >> > Date: Sat Dec 14 03:16:29 2002 -0800 >> > >> > [PATCH] deprecate use of bdflush() >> > >> > Patch from Robert Love <rml@tech9.net> >> > >> > We can never get rid of it if we do not deprecate it - so do so and >> > print a stern warning to those who still run bdflush daemons. >> > >> > Deprecated for 18.5 years by now - I seriously suspect that we have >> > some contributors younger than that... >> >> Haven't found that warning in over 7 years' worth of console logs, and >> I'm a good candidate for running the oldest userland in existence for m68k. >> >> Time to let it go. > > The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz, > which is a very old ramdisk from right after the m68k a.out to ELF > transition: > > warning: process `update' used the obsolete bdflush system call > Fix your initscripts? > > I still boot it, once in a while. The only thing left in bdflush is func == 1 calls do_exit(0); Which is a hack introduced in 2.3.23 aka October of 1999 to force the userspace process calling bdflush to exit, instead of repeatedly calling sys_bdflush. Can you try deleting that func == 1 call and seeing if your old ramdisk works? I suspect userspace used to get into a tight spin calling bdflush func == 1, when that function stopped doing anything. That was back in 1999 so we are probably safe with out it. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-28 7:31 ` Geert Uytterhoeven 2021-06-28 16:20 ` Eric W. Biederman @ 2021-06-28 17:14 ` Michael Schmitz 2021-06-28 19:17 ` Geert Uytterhoeven 1 sibling, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-28 17:14 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook Hi Geert, Am 28.06.2021 um 19:31 schrieb Geert Uytterhoeven: >> Haven't found that warning in over 7 years' worth of console logs, and >> I'm a good candidate for running the oldest userland in existence for m68k. >> >> Time to let it go. > > The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz, > which is a very old ramdisk from right after the m68k a.out to ELF > transition: > > warning: process `update' used the obsolete bdflush system call > Fix your initscripts? > > I still boot it, once in a while. OK; you take the cake. That ramdisk came to mind when I thought about where I'd last seen bdflush, but I've not used it in ages (not sure 14 MB are enough for that). The question then is - will bdflush fail gracefully, or spin retrying the syscall? Cheers, Michael > > Gr{oetje,eeting}s, > > Geert > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-28 17:14 ` Michael Schmitz @ 2021-06-28 19:17 ` Geert Uytterhoeven 2021-06-28 20:13 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Geert Uytterhoeven @ 2021-06-28 19:17 UTC (permalink / raw) To: Michael Schmitz Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook Hi Michael, On Mon, Jun 28, 2021 at 7:14 PM Michael Schmitz <schmitzmic@gmail.com> wrote: > Am 28.06.2021 um 19:31 schrieb Geert Uytterhoeven: > >> Haven't found that warning in over 7 years' worth of console logs, and > >> I'm a good candidate for running the oldest userland in existence for m68k. > >> > >> Time to let it go. > > > > The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz, > > which is a very old ramdisk from right after the m68k a.out to ELF > > transition: > > > > warning: process `update' used the obsolete bdflush system call > > Fix your initscripts? > > > > I still boot it, once in a while. > > OK; you take the cake. That ramdisk came to mind when I thought about > where I'd last seen bdflush, but I've not used it in ages (not sure 14 > MB are enough for that). Of course it will work on your 14 MiB machine! It fits on a floppy, _after_ decompression. It was used by people to install Linux on the hard disks of their beefy m68k machines, after they had set up the family Christmas tree, in December 1996. I also have a slightly larger one, built from OpenWRT when I did my first experiments on that. Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open a shell on the serial console, so it is more useful to me. > The question then is - will bdflush fail gracefully, or spin retrying > the syscall? Will add to my todo list... BTW, you can boot this ramdisk on ARAnyM, too. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-28 19:17 ` Geert Uytterhoeven @ 2021-06-28 20:13 ` Michael Schmitz 2021-06-28 21:18 ` Geert Uytterhoeven 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-28 20:13 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook Hi Geert, On 29/06/21 7:17 am, Geert Uytterhoeven wrote: >>> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz, >>> which is a very old ramdisk from right after the m68k a.out to ELF >>> transition: >>> >>> warning: process `update' used the obsolete bdflush system call >>> Fix your initscripts? >>> >>> I still boot it, once in a while. >> OK; you take the cake. That ramdisk came to mind when I thought about >> where I'd last seen bdflush, but I've not used it in ages (not sure 14 >> MB are enough for that). > Of course it will work on your 14 MiB machine! It fits on a floppy, _after_ > decompression. It was used by people to install Linux on the hard disks > of their beefy m68k machines, after they had set up the family Christmas > tree, in December 1996. Been there, done that. Wrote the HOWTO for ext2 filesystem byte-swapping. > I also have a slightly larger one, built from OpenWRT when I did my first > experiments on that. Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open > a shell on the serial console, so it is more useful to me. > >> The question then is - will bdflush fail gracefully, or spin retrying >> the syscall? > Will add to my todo list... > BTW, you can boot this ramdisk on ARAnyM, too. True. I can't find that ramdisk image anywhere - if you can point me to some archive, I'll give that a try. Cheers, Michael > > Gr{oetje,eeting}s, > > Geert > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-28 20:13 ` Michael Schmitz @ 2021-06-28 21:18 ` Geert Uytterhoeven 2021-06-28 23:42 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Geert Uytterhoeven @ 2021-06-28 21:18 UTC (permalink / raw) To: Michael Schmitz Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook Hi Michael, On Mon, Jun 28, 2021 at 10:13 PM Michael Schmitz <schmitzmic@gmail.com> wrote: > On 29/06/21 7:17 am, Geert Uytterhoeven wrote: > >>> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz, > >>> which is a very old ramdisk from right after the m68k a.out to ELF > >>> transition: > >>> > >>> warning: process `update' used the obsolete bdflush system call > >>> Fix your initscripts? > >>> > >>> I still boot it, once in a while. > >> OK; you take the cake. That ramdisk came to mind when I thought about > >> where I'd last seen bdflush, but I've not used it in ages (not sure 14 > >> MB are enough for that). > > Of course it will work on your 14 MiB machine! It fits on a floppy, _after_ > > decompression. It was used by people to install Linux on the hard disks > > of their beefy m68k machines, after they had set up the family Christmas > > tree, in December 1996. > > Been there, done that. Wrote the HOWTO for ext2 filesystem byte-swapping. I knew I could revive your memory ;-) > > I also have a slightly larger one, built from OpenWRT when I did my first > > experiments on that. Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open > > a shell on the serial console, so it is more useful to me. > > > >> The question then is - will bdflush fail gracefully, or spin retrying > >> the syscall? > > Will add to my todo list... > > BTW, you can boot this ramdisk on ARAnyM, too. > > True. I can't find that ramdisk image anywhere - if you can point me to > some archive, I'll give that a try. http://ftp.mac.linux-m68k.org/pub/linux-mac68k/initrd/ Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-28 21:18 ` Geert Uytterhoeven @ 2021-06-28 23:42 ` Michael Schmitz 2021-06-29 20:28 ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-28 23:42 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Al Viro, Eric W. Biederman, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook Hi Geert, On 29/06/21 9:18 am, Geert Uytterhoeven wrote: > >>>> The question then is - will bdflush fail gracefully, or spin retrying >>>> the syscall? >>> Will add to my todo list... >>> BTW, you can boot this ramdisk on ARAnyM, too. >> True. I can't find that ramdisk image anywhere - if you can point me to >> some archive, I'll give that a try. > http://ftp.mac.linux-m68k.org/pub/linux-mac68k/initrd/ Thanks - removing the if (func==1) do_exit(0); part does give similar behaviour as before - kernel warns five times, then shuts up (without change, warns twice only, and /sbin/update no longer runs). Removing the syscall from the m68k syscall table altogether still gives a working ramdisk. /sbin/update is still running, so evidently doesn't care about the invalid syscall result ... Cheers, Michael > > Gr{oetje,eeting}s, > > Geert > ^ permalink raw reply [flat|nested] 119+ messages in thread
* [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call 2021-06-28 23:42 ` Michael Schmitz @ 2021-06-29 20:28 ` Eric W. Biederman 2021-06-29 21:45 ` Michael Schmitz ` (3 more replies) 0 siblings, 4 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-29 20:28 UTC (permalink / raw) To: Michael Schmitz Cc: Geert Uytterhoeven, Al Viro, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, linux-api The bdflush system call has been deprecated for a very long time. Recently Michael Schmitz tested[1] and found that the last known caller of of the bdflush system call is unaffected by it's removal. Since the code is not needed delete it. [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- I think we have consensus that bdflush can be removed. Can folks please verify I have removed it correctly? Michael could you give me a Tested-by on this patch? arch/alpha/kernel/syscalls/syscall.tbl | 2 +- arch/arm/tools/syscall.tbl | 2 +- arch/arm64/include/asm/unistd32.h | 2 +- arch/ia64/kernel/syscalls/syscall.tbl | 2 +- arch/m68k/kernel/syscalls/syscall.tbl | 2 +- arch/microblaze/kernel/syscalls/syscall.tbl | 2 +- arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +- arch/parisc/kernel/syscalls/syscall.tbl | 2 +- arch/powerpc/kernel/syscalls/syscall.tbl | 2 +- arch/s390/kernel/syscalls/syscall.tbl | 2 +- arch/sh/kernel/syscalls/syscall.tbl | 2 +- arch/sparc/kernel/syscalls/syscall.tbl | 2 +- arch/x86/entry/syscalls/syscall_32.tbl | 2 +- arch/xtensa/kernel/syscalls/syscall.tbl | 2 +- fs/buffer.c | 27 ------------------- include/linux/syscalls.h | 1 - include/uapi/linux/capability.h | 1 - kernel/sys_ni.c | 1 - .../arch/powerpc/entry/syscalls/syscall.tbl | 2 +- .../perf/arch/s390/entry/syscalls/syscall.tbl | 2 +- 20 files changed, 16 insertions(+), 46 deletions(-) diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 3000a2e8ee21..85d2bcd9cf36 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -230,7 +230,7 @@ 259 common osf_swapctl sys_ni_syscall 260 common osf_memcntl sys_ni_syscall 261 common osf_fdatasync sys_ni_syscall -300 common bdflush sys_bdflush +300 common bdflush sys_ni_syscall 301 common sethae sys_sethae 302 common mount sys_mount 303 common old_adjtimex sys_old_adjtimex diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 28e03b5fec00..241988512648 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -147,7 +147,7 @@ 131 common quotactl sys_quotactl 132 common getpgid sys_getpgid 133 common fchdir sys_fchdir -134 common bdflush sys_bdflush +134 common bdflush sys_ni_syscall 135 common sysfs sys_sysfs 136 common personality sys_personality # 137 was sys_afs_syscall diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 5dab69d2c22b..a35cd6c4909c 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -279,7 +279,7 @@ __SYSCALL(__NR_getpgid, sys_getpgid) #define __NR_fchdir 133 __SYSCALL(__NR_fchdir, sys_fchdir) #define __NR_bdflush 134 -__SYSCALL(__NR_bdflush, sys_bdflush) +__SYSCALL(__NR_bdflush, sys_ni_syscall) #define __NR_sysfs 135 __SYSCALL(__NR_sysfs, sys_sysfs) #define __NR_personality 136 diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index bb11fe4c875a..7de53a9a2972 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -123,7 +123,7 @@ # 1135 was get_kernel_syms # 1136 was query_module 113 common quotactl sys_quotactl -114 common bdflush sys_bdflush +114 common bdflush sys_ni_syscall 115 common sysfs sys_sysfs 116 common personality sys_personality 117 common afs_syscall sys_ni_syscall diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 79c2d24c89dd..be5abd9c8c07 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -141,7 +141,7 @@ 131 common quotactl sys_quotactl 132 common getpgid sys_getpgid 133 common fchdir sys_fchdir -134 common bdflush sys_bdflush +134 common bdflush sys_ni_syscall 135 common sysfs sys_sysfs 136 common personality sys_personality # 137 was afs_syscall diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index b11395a20c20..555fd987f4ab 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -141,7 +141,7 @@ 131 common quotactl sys_quotactl 132 common getpgid sys_getpgid 133 common fchdir sys_fchdir -134 common bdflush sys_bdflush +134 common bdflush sys_ni_syscall 135 common sysfs sys_sysfs 136 common personality sys_personality 137 common afs_syscall sys_ni_syscall diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index d560c467a8c6..2c6b10db3bd5 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -145,7 +145,7 @@ 131 o32 quotactl sys_quotactl 132 o32 getpgid sys_getpgid 133 o32 fchdir sys_fchdir -134 o32 bdflush sys_bdflush +134 o32 bdflush sys_ni_syscall 135 o32 sysfs sys_sysfs 136 o32 personality sys_personality sys_32_personality 137 o32 afs_syscall sys_ni_syscall diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index aabc37f8cae3..51c156cb00f1 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -147,7 +147,7 @@ 131 common quotactl sys_quotactl 132 common getpgid sys_getpgid 133 common fchdir sys_fchdir -134 common bdflush sys_bdflush +134 common bdflush sys_ni_syscall 135 common sysfs sys_sysfs 136 32 personality parisc_personality 136 64 personality sys_personality diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 8f052ff4058c..2518e4e6dccf 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -176,7 +176,7 @@ 131 nospu quotactl sys_quotactl 132 common getpgid sys_getpgid 133 common fchdir sys_fchdir -134 common bdflush sys_bdflush +134 common bdflush sys_ni_syscall 135 common sysfs sys_sysfs 136 32 personality sys_personality ppc64_personality 136 64 personality ppc64_personality diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 0690263df1dd..ffcf03714f12 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -122,7 +122,7 @@ 131 common quotactl sys_quotactl sys_quotactl 132 common getpgid sys_getpgid sys_getpgid 133 common fchdir sys_fchdir sys_fchdir -134 common bdflush sys_bdflush sys_bdflush +134 common bdflush sys_ni_syscall sys_ni_syscall 135 common sysfs sys_sysfs sys_sysfs 136 common personality sys_s390_personality sys_s390_personality 137 common afs_syscall - - diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 0b91499ebdcf..6e7305066a70 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -141,7 +141,7 @@ 131 common quotactl sys_quotactl 132 common getpgid sys_getpgid 133 common fchdir sys_fchdir -134 common bdflush sys_bdflush +134 common bdflush sys_ni_syscall 135 common sysfs sys_sysfs 136 common personality sys_personality # 137 was afs_syscall diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index e34cc30ef22c..bf330dda7c61 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -270,7 +270,7 @@ 222 common delete_module sys_delete_module 223 common get_kernel_syms sys_ni_syscall 224 common getpgid sys_getpgid -225 common bdflush sys_bdflush +225 common bdflush sys_ni_syscall 226 common sysfs sys_sysfs 227 common afs_syscall sys_nis_syscall 228 common setfsuid sys_setfsuid16 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 4bbc267fb36b..a21a72763d58 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -145,7 +145,7 @@ 131 i386 quotactl sys_quotactl 132 i386 getpgid sys_getpgid 133 i386 fchdir sys_fchdir -134 i386 bdflush sys_bdflush +134 i386 bdflush sys_ni_syscall 135 i386 sysfs sys_sysfs 136 i386 personality sys_personality 137 i386 afs_syscall diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index fd2f30227d96..db4e3d09b249 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -223,7 +223,7 @@ # 205 was old nfsservctl 205 common nfsservctl sys_ni_syscall 206 common _sysctl sys_ni_syscall -207 common bdflush sys_bdflush +207 common bdflush sys_ni_syscall 208 common uname sys_newuname 209 common sysinfo sys_sysinfo 210 common init_module sys_init_module diff --git a/fs/buffer.c b/fs/buffer.c index ea48c01fb76b..04ddff76c860 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3292,33 +3292,6 @@ int try_to_free_buffers(struct page *page) } EXPORT_SYMBOL(try_to_free_buffers); -/* - * There are no bdflush tunables left. But distributions are - * still running obsolete flush daemons, so we terminate them here. - * - * Use of bdflush() is deprecated and will be removed in a future kernel. - * The `flush-X' kernel threads fully replace bdflush daemons and this call. - */ -SYSCALL_DEFINE2(bdflush, int, func, long, data) -{ - static int msg_count; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (msg_count < 5) { - msg_count++; - printk(KERN_INFO - "warning: process `%s' used the obsolete bdflush" - " system call\n", current->comm); - printk(KERN_INFO "Fix your initscripts?\n"); - } - - if (func == 1) - do_exit(0); - return 0; -} - /* * Buffer-head allocation */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 050511e8f1f8..1bd6e05ea116 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1157,7 +1157,6 @@ asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf); asmlinkage long sys_vfork(void); asmlinkage long sys_recv(int, void __user *, size_t, unsigned); asmlinkage long sys_send(int, void __user *, size_t, unsigned); -asmlinkage long sys_bdflush(int func, long data); asmlinkage long sys_oldumount(char __user *name); asmlinkage long sys_uselib(const char __user *library); asmlinkage long sys_sysfs(int option, diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h index 2ddb4226cd23..463d1ba2232a 100644 --- a/include/uapi/linux/capability.h +++ b/include/uapi/linux/capability.h @@ -243,7 +243,6 @@ struct vfs_ns_cap_data { /* Allow examination and configuration of disk quotas */ /* Allow setting the domainname */ /* Allow setting the hostname */ -/* Allow calling bdflush() */ /* Allow mount() and umount(), setting up new smb connection */ /* Allow some autofs root ioctls */ /* Allow nfsservctl */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 0ea8128468c3..adf4d66ffae2 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -414,7 +414,6 @@ COND_SYSCALL(epoll_wait); COND_SYSCALL(recv); COND_SYSCALL_COMPAT(recv); COND_SYSCALL(send); -COND_SYSCALL(bdflush); COND_SYSCALL(uselib); /* optional: time32 */ diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl index 2e68fbb57cc6..ab72dec9dadb 100644 --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl @@ -176,7 +176,7 @@ 131 nospu quotactl sys_quotactl 132 common getpgid sys_getpgid 133 common fchdir sys_fchdir -134 common bdflush sys_bdflush +134 common bdflush sys_ni_syscall 135 common sysfs sys_sysfs 136 32 personality sys_personality ppc64_personality 136 64 personality ppc64_personality diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl index 7e4a2aba366d..f2eba775e676 100644 --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl @@ -122,7 +122,7 @@ 131 common quotactl sys_quotactl sys_quotactl 132 common getpgid sys_getpgid sys_getpgid 133 common fchdir sys_fchdir sys_fchdir -134 common bdflush sys_bdflush sys_bdflush +134 common bdflush - - 135 common sysfs sys_sysfs sys_sysfs 136 common personality sys_s390_personality sys_s390_personality 137 common afs_syscall - - -- 2.20.1 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call 2021-06-29 20:28 ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman @ 2021-06-29 21:45 ` Michael Schmitz 2021-06-30 8:24 ` Geert Uytterhoeven ` (2 subsequent siblings) 3 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-29 21:45 UTC (permalink / raw) To: Eric W. Biederman Cc: Geert Uytterhoeven, Al Viro, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, linux-api On 30/06/21 8:28 am, Eric W. Biederman wrote: > The bdflush system call has been deprecated for a very long time. > Recently Michael Schmitz tested[1] and found that the last known > caller of of the bdflush system call is unaffected by it's removal. > > Since the code is not needed delete it. > > [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Tested-by: Michael Schmitz <schmitzmic@gmail.com> > --- > > I think we have consensus that bdflush can be removed. Can folks please > verify I have removed it correctly? > > Michael could you give me a Tested-by on this patch? > > arch/alpha/kernel/syscalls/syscall.tbl | 2 +- > arch/arm/tools/syscall.tbl | 2 +- > arch/arm64/include/asm/unistd32.h | 2 +- > arch/ia64/kernel/syscalls/syscall.tbl | 2 +- > arch/m68k/kernel/syscalls/syscall.tbl | 2 +- > arch/microblaze/kernel/syscalls/syscall.tbl | 2 +- > arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +- > arch/parisc/kernel/syscalls/syscall.tbl | 2 +- > arch/powerpc/kernel/syscalls/syscall.tbl | 2 +- > arch/s390/kernel/syscalls/syscall.tbl | 2 +- > arch/sh/kernel/syscalls/syscall.tbl | 2 +- > arch/sparc/kernel/syscalls/syscall.tbl | 2 +- > arch/x86/entry/syscalls/syscall_32.tbl | 2 +- > arch/xtensa/kernel/syscalls/syscall.tbl | 2 +- > fs/buffer.c | 27 ------------------- > include/linux/syscalls.h | 1 - > include/uapi/linux/capability.h | 1 - > kernel/sys_ni.c | 1 - > .../arch/powerpc/entry/syscalls/syscall.tbl | 2 +- > .../perf/arch/s390/entry/syscalls/syscall.tbl | 2 +- > 20 files changed, 16 insertions(+), 46 deletions(-) > > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl > index 3000a2e8ee21..85d2bcd9cf36 100644 > --- a/arch/alpha/kernel/syscalls/syscall.tbl > +++ b/arch/alpha/kernel/syscalls/syscall.tbl > @@ -230,7 +230,7 @@ > 259 common osf_swapctl sys_ni_syscall > 260 common osf_memcntl sys_ni_syscall > 261 common osf_fdatasync sys_ni_syscall > -300 common bdflush sys_bdflush > +300 common bdflush sys_ni_syscall > 301 common sethae sys_sethae > 302 common mount sys_mount > 303 common old_adjtimex sys_old_adjtimex > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl > index 28e03b5fec00..241988512648 100644 > --- a/arch/arm/tools/syscall.tbl > +++ b/arch/arm/tools/syscall.tbl > @@ -147,7 +147,7 @@ > 131 common quotactl sys_quotactl > 132 common getpgid sys_getpgid > 133 common fchdir sys_fchdir > -134 common bdflush sys_bdflush > +134 common bdflush sys_ni_syscall > 135 common sysfs sys_sysfs > 136 common personality sys_personality > # 137 was sys_afs_syscall > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h > index 5dab69d2c22b..a35cd6c4909c 100644 > --- a/arch/arm64/include/asm/unistd32.h > +++ b/arch/arm64/include/asm/unistd32.h > @@ -279,7 +279,7 @@ __SYSCALL(__NR_getpgid, sys_getpgid) > #define __NR_fchdir 133 > __SYSCALL(__NR_fchdir, sys_fchdir) > #define __NR_bdflush 134 > -__SYSCALL(__NR_bdflush, sys_bdflush) > +__SYSCALL(__NR_bdflush, sys_ni_syscall) > #define __NR_sysfs 135 > __SYSCALL(__NR_sysfs, sys_sysfs) > #define __NR_personality 136 > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl > index bb11fe4c875a..7de53a9a2972 100644 > --- a/arch/ia64/kernel/syscalls/syscall.tbl > +++ b/arch/ia64/kernel/syscalls/syscall.tbl > @@ -123,7 +123,7 @@ > # 1135 was get_kernel_syms > # 1136 was query_module > 113 common quotactl sys_quotactl > -114 common bdflush sys_bdflush > +114 common bdflush sys_ni_syscall > 115 common sysfs sys_sysfs > 116 common personality sys_personality > 117 common afs_syscall sys_ni_syscall > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl > index 79c2d24c89dd..be5abd9c8c07 100644 > --- a/arch/m68k/kernel/syscalls/syscall.tbl > +++ b/arch/m68k/kernel/syscalls/syscall.tbl > @@ -141,7 +141,7 @@ > 131 common quotactl sys_quotactl > 132 common getpgid sys_getpgid > 133 common fchdir sys_fchdir > -134 common bdflush sys_bdflush > +134 common bdflush sys_ni_syscall > 135 common sysfs sys_sysfs > 136 common personality sys_personality > # 137 was afs_syscall > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl > index b11395a20c20..555fd987f4ab 100644 > --- a/arch/microblaze/kernel/syscalls/syscall.tbl > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl > @@ -141,7 +141,7 @@ > 131 common quotactl sys_quotactl > 132 common getpgid sys_getpgid > 133 common fchdir sys_fchdir > -134 common bdflush sys_bdflush > +134 common bdflush sys_ni_syscall > 135 common sysfs sys_sysfs > 136 common personality sys_personality > 137 common afs_syscall sys_ni_syscall > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl > index d560c467a8c6..2c6b10db3bd5 100644 > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl > @@ -145,7 +145,7 @@ > 131 o32 quotactl sys_quotactl > 132 o32 getpgid sys_getpgid > 133 o32 fchdir sys_fchdir > -134 o32 bdflush sys_bdflush > +134 o32 bdflush sys_ni_syscall > 135 o32 sysfs sys_sysfs > 136 o32 personality sys_personality sys_32_personality > 137 o32 afs_syscall sys_ni_syscall > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl > index aabc37f8cae3..51c156cb00f1 100644 > --- a/arch/parisc/kernel/syscalls/syscall.tbl > +++ b/arch/parisc/kernel/syscalls/syscall.tbl > @@ -147,7 +147,7 @@ > 131 common quotactl sys_quotactl > 132 common getpgid sys_getpgid > 133 common fchdir sys_fchdir > -134 common bdflush sys_bdflush > +134 common bdflush sys_ni_syscall > 135 common sysfs sys_sysfs > 136 32 personality parisc_personality > 136 64 personality sys_personality > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl > index 8f052ff4058c..2518e4e6dccf 100644 > --- a/arch/powerpc/kernel/syscalls/syscall.tbl > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl > @@ -176,7 +176,7 @@ > 131 nospu quotactl sys_quotactl > 132 common getpgid sys_getpgid > 133 common fchdir sys_fchdir > -134 common bdflush sys_bdflush > +134 common bdflush sys_ni_syscall > 135 common sysfs sys_sysfs > 136 32 personality sys_personality ppc64_personality > 136 64 personality ppc64_personality > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl > index 0690263df1dd..ffcf03714f12 100644 > --- a/arch/s390/kernel/syscalls/syscall.tbl > +++ b/arch/s390/kernel/syscalls/syscall.tbl > @@ -122,7 +122,7 @@ > 131 common quotactl sys_quotactl sys_quotactl > 132 common getpgid sys_getpgid sys_getpgid > 133 common fchdir sys_fchdir sys_fchdir > -134 common bdflush sys_bdflush sys_bdflush > +134 common bdflush sys_ni_syscall sys_ni_syscall > 135 common sysfs sys_sysfs sys_sysfs > 136 common personality sys_s390_personality sys_s390_personality > 137 common afs_syscall - - > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl > index 0b91499ebdcf..6e7305066a70 100644 > --- a/arch/sh/kernel/syscalls/syscall.tbl > +++ b/arch/sh/kernel/syscalls/syscall.tbl > @@ -141,7 +141,7 @@ > 131 common quotactl sys_quotactl > 132 common getpgid sys_getpgid > 133 common fchdir sys_fchdir > -134 common bdflush sys_bdflush > +134 common bdflush sys_ni_syscall > 135 common sysfs sys_sysfs > 136 common personality sys_personality > # 137 was afs_syscall > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl > index e34cc30ef22c..bf330dda7c61 100644 > --- a/arch/sparc/kernel/syscalls/syscall.tbl > +++ b/arch/sparc/kernel/syscalls/syscall.tbl > @@ -270,7 +270,7 @@ > 222 common delete_module sys_delete_module > 223 common get_kernel_syms sys_ni_syscall > 224 common getpgid sys_getpgid > -225 common bdflush sys_bdflush > +225 common bdflush sys_ni_syscall > 226 common sysfs sys_sysfs > 227 common afs_syscall sys_nis_syscall > 228 common setfsuid sys_setfsuid16 > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 4bbc267fb36b..a21a72763d58 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -145,7 +145,7 @@ > 131 i386 quotactl sys_quotactl > 132 i386 getpgid sys_getpgid > 133 i386 fchdir sys_fchdir > -134 i386 bdflush sys_bdflush > +134 i386 bdflush sys_ni_syscall > 135 i386 sysfs sys_sysfs > 136 i386 personality sys_personality > 137 i386 afs_syscall > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl > index fd2f30227d96..db4e3d09b249 100644 > --- a/arch/xtensa/kernel/syscalls/syscall.tbl > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl > @@ -223,7 +223,7 @@ > # 205 was old nfsservctl > 205 common nfsservctl sys_ni_syscall > 206 common _sysctl sys_ni_syscall > -207 common bdflush sys_bdflush > +207 common bdflush sys_ni_syscall > 208 common uname sys_newuname > 209 common sysinfo sys_sysinfo > 210 common init_module sys_init_module > diff --git a/fs/buffer.c b/fs/buffer.c > index ea48c01fb76b..04ddff76c860 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -3292,33 +3292,6 @@ int try_to_free_buffers(struct page *page) > } > EXPORT_SYMBOL(try_to_free_buffers); > > -/* > - * There are no bdflush tunables left. But distributions are > - * still running obsolete flush daemons, so we terminate them here. > - * > - * Use of bdflush() is deprecated and will be removed in a future kernel. > - * The `flush-X' kernel threads fully replace bdflush daemons and this call. > - */ > -SYSCALL_DEFINE2(bdflush, int, func, long, data) > -{ > - static int msg_count; > - > - if (!capable(CAP_SYS_ADMIN)) > - return -EPERM; > - > - if (msg_count < 5) { > - msg_count++; > - printk(KERN_INFO > - "warning: process `%s' used the obsolete bdflush" > - " system call\n", current->comm); > - printk(KERN_INFO "Fix your initscripts?\n"); > - } > - > - if (func == 1) > - do_exit(0); > - return 0; > -} > - > /* > * Buffer-head allocation > */ > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 050511e8f1f8..1bd6e05ea116 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -1157,7 +1157,6 @@ asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf); > asmlinkage long sys_vfork(void); > asmlinkage long sys_recv(int, void __user *, size_t, unsigned); > asmlinkage long sys_send(int, void __user *, size_t, unsigned); > -asmlinkage long sys_bdflush(int func, long data); > asmlinkage long sys_oldumount(char __user *name); > asmlinkage long sys_uselib(const char __user *library); > asmlinkage long sys_sysfs(int option, > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h > index 2ddb4226cd23..463d1ba2232a 100644 > --- a/include/uapi/linux/capability.h > +++ b/include/uapi/linux/capability.h > @@ -243,7 +243,6 @@ struct vfs_ns_cap_data { > /* Allow examination and configuration of disk quotas */ > /* Allow setting the domainname */ > /* Allow setting the hostname */ > -/* Allow calling bdflush() */ > /* Allow mount() and umount(), setting up new smb connection */ > /* Allow some autofs root ioctls */ > /* Allow nfsservctl */ > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 0ea8128468c3..adf4d66ffae2 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -414,7 +414,6 @@ COND_SYSCALL(epoll_wait); > COND_SYSCALL(recv); > COND_SYSCALL_COMPAT(recv); > COND_SYSCALL(send); > -COND_SYSCALL(bdflush); > COND_SYSCALL(uselib); > > /* optional: time32 */ > diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > index 2e68fbb57cc6..ab72dec9dadb 100644 > --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > @@ -176,7 +176,7 @@ > 131 nospu quotactl sys_quotactl > 132 common getpgid sys_getpgid > 133 common fchdir sys_fchdir > -134 common bdflush sys_bdflush > +134 common bdflush sys_ni_syscall > 135 common sysfs sys_sysfs > 136 32 personality sys_personality ppc64_personality > 136 64 personality ppc64_personality > diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl > index 7e4a2aba366d..f2eba775e676 100644 > --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl > +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl > @@ -122,7 +122,7 @@ > 131 common quotactl sys_quotactl sys_quotactl > 132 common getpgid sys_getpgid sys_getpgid > 133 common fchdir sys_fchdir sys_fchdir > -134 common bdflush sys_bdflush sys_bdflush > +134 common bdflush - - > 135 common sysfs sys_sysfs sys_sysfs > 136 common personality sys_s390_personality sys_s390_personality > 137 common afs_syscall - - ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call 2021-06-29 20:28 ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman 2021-06-29 21:45 ` Michael Schmitz @ 2021-06-30 8:24 ` Geert Uytterhoeven 2021-06-30 8:37 ` Arnd Bergmann 2021-06-30 12:30 ` Cyril Hrubis 3 siblings, 0 replies; 119+ messages in thread From: Geert Uytterhoeven @ 2021-06-30 8:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, Al Viro, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, Linux API On Tue, Jun 29, 2021 at 10:28 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > The bdflush system call has been deprecated for a very long time. > Recently Michael Schmitz tested[1] and found that the last known > caller of of the bdflush system call is unaffected by it's removal. > > Since the code is not needed delete it. > > [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > arch/m68k/kernel/syscalls/syscall.tbl | 2 +- Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call 2021-06-29 20:28 ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman 2021-06-29 21:45 ` Michael Schmitz 2021-06-30 8:24 ` Geert Uytterhoeven @ 2021-06-30 8:37 ` Arnd Bergmann 2021-06-30 12:30 ` Cyril Hrubis 3 siblings, 0 replies; 119+ messages in thread From: Arnd Bergmann @ 2021-06-30 8:37 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, Geert Uytterhoeven, Al Viro, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Tejun Heo, Kees Cook, Linux API On Tue, Jun 29, 2021 at 10:28 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > The bdflush system call has been deprecated for a very long time. > Recently Michael Schmitz tested[1] and found that the last known > caller of of the bdflush system call is unaffected by it's removal. > > Since the code is not needed delete it. > > [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > > I think we have consensus that bdflush can be removed. Can folks please > verify I have removed it correctly? Reviewed-by: Arnd Bergmann <arnd@arndb.de> We are traditionally somewhat inconsistent about whether to leave the __NR_bdflush macro present in asm/unistd.h or to remove it when the syscall is gone. Leaving it in place as you do is probably better here. Arnd ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call 2021-06-29 20:28 ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman ` (2 preceding siblings ...) 2021-06-30 8:37 ` Arnd Bergmann @ 2021-06-30 12:30 ` Cyril Hrubis 3 siblings, 0 replies; 119+ messages in thread From: Cyril Hrubis @ 2021-06-30 12:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, Geert Uytterhoeven, Al Viro, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Tejun Heo, Kees Cook, linux-api Hi! I've send a similar patch [1] a while ago when I removed bdflush tests from LTP. [1] https://lore.kernel.org/lkml/20190528101012.11402-1-chrubis@suse.cz/ Acked-by: Cyril Hrubis <chrubis@suse.cz> -- Cyril Hrubis chrubis@suse.cz ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH 0/9] Refactoring exit 2021-06-24 22:45 ` [PATCH 0/9] Refactoring exit Al Viro 2021-06-27 22:13 ` Al Viro @ 2021-06-28 19:02 ` Eric W. Biederman 1 sibling, 0 replies; 119+ messages in thread From: Eric W. Biederman @ 2021-06-28 19:02 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Al Viro <viro@zeniv.linux.org.uk> writes: > On Thu, Jun 24, 2021 at 01:57:35PM -0500, Eric W. Biederman wrote: > >> So far the code has been lightly tested, and the descriptions of some >> of the patches are a bit light, but I think this shows the direction >> I am aiming to travel for sorting out exit(2) and exit_group(2). > > FWIW, here's the current picture for do_exit(), aside of exit(2) and do_exit_group(): > > 1) stuff that is clearly oops-like - > alpha:die_if_kernel() alpha:do_entUna() alpha:do_page_fault() arm:oops_end() > arm:__do_kernel_fault() arm64:die() arm64:die_kernel_fault() csky:alignment() > csky:die() csky:no_context() h8300:die() h8300:do_page_fault() hexagon:die() > ia64:die() i64:ia64_do_page_fault() m68k:die_if_kernel() m68k:send_fault_sig() > microblaze:die() mips:die() nds32:handle_fpu_exception() nds32:die() > nds32:unhandled_interruption() nds32:unhandled_exceptions() nds32:do_revinsn() > nds32:do_page_fault() nios:die() openrisc:die() openrisc:do_page_fault() > parisc:die_if_kernel() ppc:oops_end() riscv:die() riscv:die_kernel_fault() > s390:die() s390:do_no_context() s390:do_low_address() sh:die() > sparc32:die_if_kernel() sparc32:do_sparc_fault() sparc64:die_if_kernel() > x86:rewind_stack_do_exit() xtensa:die() xtensa:bad_page_fault() > We really do not want ptrace anywhere near any of those and we do not want > any of that to return; this shit would better be handled right there and > there - no "post a fatal signal" would do. Thanks that makes a good start for digging into these. I think the distinction I would make is: - If the kernel is broken use do_task_dead. - Otherwise cleanup the semantics by using start_group_exit, start_task_exit or by just cleaning up the code. Looking at the reboot case it looks like we the code should have become do_group_exit in 2.5. I have a suspicion we have a bunch of similar cases that want to terminate the entire process, but we simply never updated to deal with multi-thread processes. I suspect in the reboot case panic if machine_halt or or machine_power_off fails is more likely the correct handling. But we do have funny semantics sometimes. I will see what I can do to expand my patchset to handle all of these various callers of do_exit. Eric ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 18:59 ` Al Viro 2021-06-21 19:22 ` Linus Torvalds @ 2021-06-21 19:24 ` Al Viro 2021-06-21 23:24 ` Michael Schmitz 1 sibling, 1 reply; 119+ messages in thread From: Al Viro @ 2021-06-21 19:24 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Michael Schmitz, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook On Mon, Jun 21, 2021 at 06:59:01PM +0000, Al Viro wrote: > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote: > > On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote: > > > > > And I think our horrible "kernel threads return to user space when > > > done" is absolutely horrifically nasty. Maybe of the clever sort, but > > > mostly of the historical horror sort. > > > > How would you prefer to handle that, then? Separate magical path from > > kernel_execve() to switch to userland? We used to have something of > > that sort, and that had been a real horror... > > > > As it is, it's "kernel thread is spawned at the point similar to > > ret_from_fork(), runs the payload (which almost never returns) and > > then proceeds out to userland, same way fork(2) would've done." > > That way kernel_execve() doesn't have to do anything magical. > > > > Al, digging through the old notes and current call graph... > > There's a large mess around do_exit() - we have a bunch of > callers all over arch/*; if nothing else, I very much doubt that really > want to let tracer play with a thread in the middle of die_if_kernel() > or similar. > > We sure as hell do not want to arrange for anything on the kernel > stack in such situations, no matter what's done in exit(2)... FWIW, on alpha it's die_if_kernel(), do_entUna() and do_page_fault(), all in not-from-userland cases. On m68k - die_if_kernel(), do_page_fault() (both for non-from-userland cases) and something really odd - fpsp040_die(). Exception handling for floating point stuff on 68040? Looks like it has an open-coded copy_to_user()/copy_from_user(), with faults doing hard do_exit(SIGSEGV) instead of raising a signal and trying to do something sane... I really don't want to try and figure out how painful would it be to teach that code how to deal with faults - _testing_ anything in that area sure as hell will be. IIRC, details of recovery from FPU exceptions on 68040 in the manual left impression of a minefield... ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-21 19:24 ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Al Viro @ 2021-06-21 23:24 ` Michael Schmitz 0 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-21 23:24 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: Eric W. Biederman, linux-arch, Jens Axboe, Oleg Nesterov, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, Geert Uytterhoeven, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Al, On 22/06/21 7:24 am, Al Viro wrote: > >> There's a large mess around do_exit() - we have a bunch of >> callers all over arch/*; if nothing else, I very much doubt that really >> want to let tracer play with a thread in the middle of die_if_kernel() >> or similar. >> >> We sure as hell do not want to arrange for anything on the kernel >> stack in such situations, no matter what's done in exit(2)... > FWIW, on alpha it's die_if_kernel(), do_entUna() and do_page_fault(), > all in not-from-userland cases. On m68k - die_if_kernel(), do_page_fault() > (both for non-from-userland cases) and something really odd - fpsp040_die(). > Exception handling for floating point stuff on 68040? Looks like it has Exception handling for emulated floating point instructions, really - exceptions happening when excecuting FPU instructions on hardware will do the normal exception processing. > an open-coded copy_to_user()/copy_from_user(), with faults doing hard > do_exit(SIGSEGV) instead of raising a signal and trying to do something > sane... Yes, that's what it does. Not pretty ... though all that using m68k copy_to_user()/copy_from_user() would change is returning how many bytes could not copied. In contrast to the ifpsp060 code, we could not pass on that return status to callers of copyin/copyout in fpsp040, so I don't see what sane thing could be done if a fault happens. (I'd expect the MMU would have raised a bus error and resolved the problem by a page fault if possible, before we ever get to this point?) > I really don't want to try and figure out how painful would it be to > teach that code how to deal with faults - _testing_ anything in that > area sure as hell will be. IIRC, details of recovery from FPU exceptions > on 68040 in the manual left impression of a minefield... This is only about faults when moving data from/to user space. FPU exceptions are handled elsewhere in the code. So we at least don't have to deal with that particular minefield. Teaching the fpsp040 code to deal with access faults looks horrible indeed... let's not go there. Cheers, Michael ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-15 19:30 ` Eric W. Biederman ` (2 preceding siblings ...) 2021-06-15 21:58 ` Linus Torvalds @ 2021-06-16 7:38 ` Geert Uytterhoeven 2021-06-16 19:40 ` Michael Schmitz 3 siblings, 1 reply; 119+ messages in thread From: Geert Uytterhoeven @ 2021-06-16 7:38 UTC (permalink / raw) To: Eric W. Biederman Cc: Michael Schmitz, Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Eric, On Tue, Jun 15, 2021 at 9:32 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > Do you happen to know if there is userspace that will run > in qemu-system-m68k that can be used for testing? There's a link to an image in Laurent's patch series "[PATCH 0/2] m68k: Add Virtual M68k Machine" https://lore.kernel.org/linux-m68k/20210323221430.3735147-1-laurent@vivier.eu/ Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads 2021-06-16 7:38 ` Geert Uytterhoeven @ 2021-06-16 19:40 ` Michael Schmitz 0 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-16 19:40 UTC (permalink / raw) To: Geert Uytterhoeven, Eric W. Biederman Cc: Linus Torvalds, linux-arch, Jens Axboe, Oleg Nesterov, Al Viro, Linux Kernel Mailing List, Richard Henderson, Ivan Kokshaysky, Matt Turner, alpha, linux-m68k, Arnd Bergmann, Ley Foon Tan, Tejun Heo, Kees Cook Hi Geert, On 16/06/21 7:38 pm, Geert Uytterhoeven wrote: > Hi Eric, > > On Tue, Jun 15, 2021 at 9:32 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> Do you happen to know if there is userspace that will run >> in qemu-system-m68k that can be used for testing? > There's a link to an image in Laurent's patch series "[PATCH 0/2] > m68k: Add Virtual M68k Machine" > https://lore.kernel.org/linux-m68k/20210323221430.3735147-1-laurent@vivier.eu/ Thanks, I'll try that one. I'll try and implement a few of the solutions Eric came up with for alpha, unless someone beats me to it (Andreas?). Cheers, Michael > > Gr{oetje,eeting}s, > > Geert > ^ permalink raw reply [flat|nested] 119+ messages in thread
* [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-10 20:57 Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Eric W. Biederman 2021-06-10 22:04 ` Linus Torvalds @ 2021-06-12 23:38 ` Michael Schmitz 2021-06-13 19:59 ` Linus Torvalds 2021-06-14 7:13 ` Michael Schmitz 1 sibling, 2 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-12 23:38 UTC (permalink / raw) To: geert, linux-arch, linux-m68k; +Cc: ebiederm, torvalds, schwab, Michael Schmitz do_exit() calls prace_stop() which may require access to all saved registers. We only save those registers not preserved by C code currently. Provide a special syscall entry for exit and exit_group syscalls similar to that used by clone and clone3, which have the same requirements. No fix to io_uring appears to be needed, because m68k copy_thread treats kernel threads the same as e.g. alpha does, and copies only a subset of registers in that case. CC: Eric W. Biederman <ebiederm@xmission.com> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> --- arch/m68k/kernel/entry.S | 14 ++++++++++++++ arch/m68k/kernel/process.c | 16 ++++++++++++++++ arch/m68k/kernel/syscalls/syscall.tbl | 4 ++-- 3 files changed, 32 insertions(+), 2 deletions(-) diff --git a/arch/m68k/kernel/entry.S b/arch/m68k/kernel/entry.S index 9dd76fb..1e067e6 100644 --- a/arch/m68k/kernel/entry.S +++ b/arch/m68k/kernel/entry.S @@ -76,6 +76,20 @@ ENTRY(__sys_clone3) lea %sp@(28),%sp rts +ENTRY(__sys_exit) + SAVE_SWITCH_STACK + pea %sp@(SWITCH_STACK_SIZE) + jbsr m68k_exit + lea %sp@(28),%sp + rts + +ENTRY(__sys_exit_group) + SAVE_SWITCH_STACK + pea %sp@(SWITCH_STACK_SIZE) + jbsr m68k_exit_group + lea %sp@(28),%sp + rts + ENTRY(sys_sigreturn) SAVE_SWITCH_STACK movel %sp,%sp@- | switch_stack pointer diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c index da83cc8..df4e5f1 100644 --- a/arch/m68k/kernel/process.c +++ b/arch/m68k/kernel/process.c @@ -138,6 +138,22 @@ asmlinkage int m68k_clone3(struct pt_regs *regs) return sys_clone3((struct clone_args __user *)regs->d1, regs->d2); } +/* + * Because extra registers are saved on the stack after the sys_exit() + * arguments, this C wrapper extracts them from pt_regs * and then calls the + * generic sys_exit() implementation. + */ +asmlinkage int m68k_exit(struct pt_regs *regs) +{ + return sys_exit(regs->d1); +} + +/* Same for sys_exit_group ... */ +asmlinkage int m68k_exit_group(struct pt_regs *regs) +{ + return sys_exit_group(regs->d1); +} + int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg, struct task_struct *p, unsigned long tls) { diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 0dd019d..3d5b6fbc 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -8,7 +8,7 @@ # The <abi> is always "common" for this file # 0 common restart_syscall sys_restart_syscall -1 common exit sys_exit +1 common exit __sys_exit 2 common fork __sys_fork 3 common read sys_read 4 common write sys_write @@ -254,7 +254,7 @@ 244 common io_submit sys_io_submit 245 common io_cancel sys_io_cancel 246 common fadvise64 sys_fadvise64 -247 common exit_group sys_exit_group +247 common exit_group __sys_exit_group 248 common lookup_dcookie sys_lookup_dcookie 249 common epoll_create sys_epoll_create 250 common epoll_ctl sys_epoll_ctl -- 2.7.4 ^ permalink raw reply related [flat|nested] 119+ messages in thread
* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz @ 2021-06-13 19:59 ` Linus Torvalds 2021-06-13 20:07 ` Michael Schmitz 2021-06-14 7:13 ` Michael Schmitz 1 sibling, 1 reply; 119+ messages in thread From: Linus Torvalds @ 2021-06-13 19:59 UTC (permalink / raw) To: Michael Schmitz Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman, Andreas Schwab On Sat, Jun 12, 2021 at 4:38 PM Michael Schmitz <schmitzmic@gmail.com> wrote: > > do_exit() calls prace_stop() which may require access to all saved > registers. We only save those registers not preserved by C code > currently. > > Provide a special syscall entry for exit and exit_group syscalls > similar to that used by clone and clone3, which have the same > requirements. ACK, this looks correct to me. It might be a good idea to generate a test-case for this - some "ptrace child, catch exit of it, show registers" kind of thing - just to show what the effects of the bug was (and to show it's fixed). But maybe it's not worth the effort. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-13 19:59 ` Linus Torvalds @ 2021-06-13 20:07 ` Michael Schmitz 2021-06-13 20:26 ` Linus Torvalds 0 siblings, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-13 20:07 UTC (permalink / raw) To: Linus Torvalds Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman, Andreas Schwab Linus, On 14/06/21 7:59 am, Linus Torvalds wrote: > On Sat, Jun 12, 2021 at 4:38 PM Michael Schmitz <schmitzmic@gmail.com> wrote: >> do_exit() calls prace_stop() which may require access to all saved >> registers. We only save those registers not preserved by C code >> currently. >> >> Provide a special syscall entry for exit and exit_group syscalls >> similar to that used by clone and clone3, which have the same >> requirements. > ACK, this looks correct to me. > > It might be a good idea to generate a test-case for this - some > "ptrace child, catch exit of it, show registers" kind of thing - just > to show what the effects of the bug was (and to show it's fixed). But > maybe it's not worth the effort. I'd love that, too. My test rig doesn't allow dumping of registers by strace, but someone else may have that capacity. Cheers, Michael > > Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-13 20:07 ` Michael Schmitz @ 2021-06-13 20:26 ` Linus Torvalds 2021-06-13 20:33 ` Linus Torvalds 2021-06-13 20:47 ` Linus Torvalds 0 siblings, 2 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-13 20:26 UTC (permalink / raw) To: Michael Schmitz Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman, Andreas Schwab On Sun, Jun 13, 2021 at 1:07 PM Michael Schmitz <schmitzmic@gmail.com> wrote: > > I'd love that, too. My test rig doesn't allow dumping of registers by > strace, but someone else may have that capacity. I think doing it manually with gdb should be fairly straightforward. Something like gdb /bin/true and then in gdb you just do b main run and then catch syscall group:process c and it should stop at the exit_group or exit system call. At that point you can just do info registers and see if they match what user space *should* be. They'll probably be complete garbage without the fix. I do not have an alpha or m68k machine to test (and not the energy/inclination to set up some virtual environment in qemu either). But it should be easy if you already have that environment. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-13 20:26 ` Linus Torvalds @ 2021-06-13 20:33 ` Linus Torvalds 2021-06-13 20:47 ` Linus Torvalds 1 sibling, 0 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-13 20:33 UTC (permalink / raw) To: Michael Schmitz Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman, Andreas Schwab On Sun, Jun 13, 2021 at 1:26 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > and then in gdb you just do > > b main > run Btw, this extra stage is unnecessary, but if I just do that "catch syscall group:process" before the process has even started, gdb gets confused at the start. You could skip this and just do "catch syscall exit_group" and then "run". I used that "group:process" just to catch both the legacy "exit" and the new "exit_group", but then it catches fork/execve too, and I think that's what confuses gdb when it happens as you start the process. Just to clarify why I did that odd thing. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-13 20:26 ` Linus Torvalds 2021-06-13 20:33 ` Linus Torvalds @ 2021-06-13 20:47 ` Linus Torvalds 1 sibling, 0 replies; 119+ messages in thread From: Linus Torvalds @ 2021-06-13 20:47 UTC (permalink / raw) To: Michael Schmitz Cc: Geert Uytterhoeven, linux-arch, linux-m68k, Eric W. Biederman, Andreas Schwab On Sun, Jun 13, 2021 at 1:26 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > They'll probably be complete garbage without the fix. Actually, never mind. My trivial gdb script is garbage. I think that even with the fix, it will be fine. Because this test will just use the regular system call entry tracing point - which gets the thing right. It's only PTRACE_EVENT_EXIT reporting that gets it wrong, not the generic system call tracing case. I'm not sure if/how you can get gdb to catch that PTRACE_EVENT_EXIT case. Sorry for my inane noise. Linus ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz 2021-06-13 19:59 ` Linus Torvalds @ 2021-06-14 7:13 ` Michael Schmitz 2021-06-14 7:40 ` Andreas Schwab 1 sibling, 1 reply; 119+ messages in thread From: Michael Schmitz @ 2021-06-14 7:13 UTC (permalink / raw) To: geert, linux-m68k; +Cc: schwab, Kars de Jong Hi Geert, do we need to add .globl __sys_exit, __sys_exit_group (and perhaps __sys_clone3) at the start of entry.S? We have that for __sys_fork, __sys_clone and __sys_vfork. Cheers, Michael Am 13.06.2021 um 11:38 schrieb Michael Schmitz: > do_exit() calls prace_stop() which may require access to all saved > registers. We only save those registers not preserved by C code > currently. > > Provide a special syscall entry for exit and exit_group syscalls > similar to that used by clone and clone3, which have the same > requirements. > > No fix to io_uring appears to be needed, because m68k copy_thread > treats kernel threads the same as e.g. alpha does, and copies only > a subset of registers in that case. > > CC: Eric W. Biederman <ebiederm@xmission.com> > CC: Linus Torvalds <torvalds@linux-foundation.org> > CC: Andreas Schwab <schwab@linux-m68k.org> > Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> > --- > arch/m68k/kernel/entry.S | 14 ++++++++++++++ > arch/m68k/kernel/process.c | 16 ++++++++++++++++ > arch/m68k/kernel/syscalls/syscall.tbl | 4 ++-- > 3 files changed, 32 insertions(+), 2 deletions(-) > > diff --git a/arch/m68k/kernel/entry.S b/arch/m68k/kernel/entry.S > index 9dd76fb..1e067e6 100644 > --- a/arch/m68k/kernel/entry.S > +++ b/arch/m68k/kernel/entry.S > @@ -76,6 +76,20 @@ ENTRY(__sys_clone3) > lea %sp@(28),%sp > rts > > +ENTRY(__sys_exit) > + SAVE_SWITCH_STACK > + pea %sp@(SWITCH_STACK_SIZE) > + jbsr m68k_exit > + lea %sp@(28),%sp > + rts > + > +ENTRY(__sys_exit_group) > + SAVE_SWITCH_STACK > + pea %sp@(SWITCH_STACK_SIZE) > + jbsr m68k_exit_group > + lea %sp@(28),%sp > + rts > + > ENTRY(sys_sigreturn) > SAVE_SWITCH_STACK > movel %sp,%sp@- | switch_stack pointer > diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c > index da83cc8..df4e5f1 100644 > --- a/arch/m68k/kernel/process.c > +++ b/arch/m68k/kernel/process.c > @@ -138,6 +138,22 @@ asmlinkage int m68k_clone3(struct pt_regs *regs) > return sys_clone3((struct clone_args __user *)regs->d1, regs->d2); > } > > +/* > + * Because extra registers are saved on the stack after the sys_exit() > + * arguments, this C wrapper extracts them from pt_regs * and then calls the > + * generic sys_exit() implementation. > + */ > +asmlinkage int m68k_exit(struct pt_regs *regs) > +{ > + return sys_exit(regs->d1); > +} > + > +/* Same for sys_exit_group ... */ > +asmlinkage int m68k_exit_group(struct pt_regs *regs) > +{ > + return sys_exit_group(regs->d1); > +} > + > int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg, > struct task_struct *p, unsigned long tls) > { > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl > index 0dd019d..3d5b6fbc 100644 > --- a/arch/m68k/kernel/syscalls/syscall.tbl > +++ b/arch/m68k/kernel/syscalls/syscall.tbl > @@ -8,7 +8,7 @@ > # The <abi> is always "common" for this file > # > 0 common restart_syscall sys_restart_syscall > -1 common exit sys_exit > +1 common exit __sys_exit > 2 common fork __sys_fork > 3 common read sys_read > 4 common write sys_write > @@ -254,7 +254,7 @@ > 244 common io_submit sys_io_submit > 245 common io_cancel sys_io_cancel > 246 common fadvise64 sys_fadvise64 > -247 common exit_group sys_exit_group > +247 common exit_group __sys_exit_group > 248 common lookup_dcookie sys_lookup_dcookie > 249 common epoll_create sys_epoll_create > 250 common epoll_ctl sys_epoll_ctl > ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-14 7:13 ` Michael Schmitz @ 2021-06-14 7:40 ` Andreas Schwab 2021-06-14 8:19 ` Michael Schmitz 0 siblings, 1 reply; 119+ messages in thread From: Andreas Schwab @ 2021-06-14 7:40 UTC (permalink / raw) To: Michael Schmitz; +Cc: geert, linux-m68k, Kars de Jong On Jun 14 2021, Michael Schmitz wrote: > do we need to add > > .globl __sys_exit, __sys_exit_group > > (and perhaps __sys_clone3) at the start of entry.S? ENTRY takes care of that. You wouldn't be able to link without that anyway. > We have that for __sys_fork, __sys_clone and __sys_vfork. Not really needed. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 119+ messages in thread
* Re: [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry 2021-06-14 7:40 ` Andreas Schwab @ 2021-06-14 8:19 ` Michael Schmitz 0 siblings, 0 replies; 119+ messages in thread From: Michael Schmitz @ 2021-06-14 8:19 UTC (permalink / raw) To: Andreas Schwab; +Cc: geert, linux-m68k, Kars de Jong Hi Andreas, Am 14.06.2021 um 19:40 schrieb Andreas Schwab: > On Jun 14 2021, Michael Schmitz wrote: > >> do we need to add >> >> .globl __sys_exit, __sys_exit_group >> >> (and perhaps __sys_clone3) at the start of entry.S? > > ENTRY takes care of that. You wouldn't be able to link without that > anyway. Thanks - I guessed being able to link (and run) the code should be good enough, but I don't try a lot of tool chains... Cheers, Michael > >> We have that for __sys_fork, __sys_clone and __sys_vfork. > > Not really needed. > > Andreas. > ^ permalink raw reply [flat|nested] 119+ messages in thread
end of thread, other threads:[~2021-06-30 12:56 UTC | newest] Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-10 20:57 Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Eric W. Biederman 2021-06-10 22:04 ` Linus Torvalds 2021-06-11 21:39 ` Eric W. Biederman 2021-06-11 23:26 ` Linus Torvalds 2021-06-13 21:54 ` Eric W. Biederman 2021-06-13 22:18 ` Linus Torvalds 2021-06-14 2:05 ` Michael Schmitz 2021-06-14 5:03 ` Michael Schmitz 2021-06-14 16:26 ` Eric W. Biederman 2021-06-14 22:26 ` Michael Schmitz 2021-06-15 19:30 ` Eric W. Biederman 2021-06-15 19:36 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Eric W. Biederman 2021-06-15 22:02 ` Linus Torvalds 2021-06-16 16:32 ` Eric W. Biederman 2021-06-16 18:29 ` [PATCH 0/2] alpha/ptrace: Improved switch_stack handling Eric W. Biederman 2021-06-16 18:31 ` [PATCH 1/2] alpha/ptrace: Record and handle the absence of switch_stack Eric W. Biederman 2021-06-16 20:00 ` Linus Torvalds 2021-06-16 20:37 ` Linus Torvalds 2021-06-16 20:57 ` Eric W. Biederman 2021-06-16 21:02 ` Al Viro 2021-06-16 21:08 ` Linus Torvalds 2021-06-16 20:42 ` Eric W. Biederman 2021-06-16 20:17 ` Al Viro 2021-06-21 2:01 ` Michael Schmitz 2021-06-21 2:17 ` Linus Torvalds 2021-06-21 3:18 ` Michael Schmitz 2021-06-21 3:37 ` Linus Torvalds 2021-06-21 4:08 ` Michael Schmitz 2021-06-21 3:44 ` Al Viro 2021-06-21 5:31 ` Michael Schmitz 2021-06-21 2:27 ` Al Viro 2021-06-21 3:36 ` Michael Schmitz 2021-06-16 18:32 ` [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames Eric W. Biederman 2021-06-16 20:25 ` Al Viro 2021-06-16 20:28 ` Al Viro 2021-06-16 20:49 ` Eric W. Biederman 2021-06-16 20:54 ` Al Viro 2021-06-16 20:47 ` Eric W. Biederman 2021-06-16 20:55 ` Al Viro 2021-06-16 20:50 ` [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads Al Viro 2021-06-15 20:56 ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Michael Schmitz 2021-06-16 0:23 ` Finn Thain 2021-06-15 21:58 ` Linus Torvalds 2021-06-16 15:06 ` Eric W. Biederman 2021-06-21 13:54 ` Al Viro 2021-06-21 14:16 ` Al Viro 2021-06-21 16:50 ` Eric W. Biederman 2021-06-21 23:05 ` Al Viro 2021-06-22 16:39 ` Eric W. Biederman 2021-06-21 15:38 ` Linus Torvalds 2021-06-21 18:59 ` Al Viro 2021-06-21 19:22 ` Linus Torvalds 2021-06-21 19:45 ` Al Viro 2021-06-21 23:14 ` Linus Torvalds 2021-06-21 23:23 ` Al Viro 2021-06-21 23:36 ` Linus Torvalds 2021-06-22 21:02 ` Eric W. Biederman 2021-06-22 21:48 ` Michael Schmitz 2021-06-23 5:26 ` Michael Schmitz 2021-06-23 14:36 ` Eric W. Biederman 2021-06-22 0:01 ` Michael Schmitz 2021-06-22 20:04 ` Michael Schmitz 2021-06-22 20:18 ` Al Viro 2021-06-22 21:57 ` Michael Schmitz 2021-06-21 20:03 ` Eric W. Biederman 2021-06-21 23:15 ` Linus Torvalds 2021-06-22 20:52 ` Eric W. Biederman 2021-06-23 0:41 ` Linus Torvalds 2021-06-23 14:33 ` Eric W. Biederman 2021-06-24 18:57 ` [PATCH 0/9] Refactoring exit Eric W. Biederman 2021-06-24 18:59 ` [PATCH 1/9] signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL) Eric W. Biederman 2021-06-24 18:59 ` [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation Eric W. Biederman 2021-06-26 3:17 ` Kees Cook 2021-06-28 19:21 ` Eric W. Biederman 2021-06-24 19:00 ` [PATCH 3/9] signal/seccomp: Dump core when there is only one live thread Eric W. Biederman 2021-06-26 3:20 ` Kees Cook 2021-06-24 19:01 ` [PATCH 4/9] signal: Factor start_group_exit out of complete_signal Eric W. Biederman 2021-06-24 20:04 ` Linus Torvalds 2021-06-26 3:24 ` Kees Cook 2021-06-24 19:01 ` [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit Eric W. Biederman 2021-06-26 3:35 ` Kees Cook 2021-06-24 19:02 ` [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads Eric W. Biederman 2021-06-26 3:42 ` Kees Cook 2021-06-28 19:25 ` Eric W. Biederman 2021-06-24 19:02 ` [PATCH 7/9] signal: Make individual tasks exiting a first class concept Eric W. Biederman 2021-06-24 20:11 ` Linus Torvalds 2021-06-24 21:37 ` Eric W. Biederman 2021-06-24 19:03 ` [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit Eric W. Biederman 2021-06-26 5:56 ` Kees Cook 2021-06-24 19:03 ` [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal Eric W. Biederman 2021-06-24 22:45 ` [PATCH 0/9] Refactoring exit Al Viro 2021-06-27 22:13 ` Al Viro 2021-06-27 22:59 ` Michael Schmitz 2021-06-28 7:31 ` Geert Uytterhoeven 2021-06-28 16:20 ` Eric W. Biederman 2021-06-28 17:14 ` Michael Schmitz 2021-06-28 19:17 ` Geert Uytterhoeven 2021-06-28 20:13 ` Michael Schmitz 2021-06-28 21:18 ` Geert Uytterhoeven 2021-06-28 23:42 ` Michael Schmitz 2021-06-29 20:28 ` [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call Eric W. Biederman 2021-06-29 21:45 ` Michael Schmitz 2021-06-30 8:24 ` Geert Uytterhoeven 2021-06-30 8:37 ` Arnd Bergmann 2021-06-30 12:30 ` Cyril Hrubis 2021-06-28 19:02 ` [PATCH 0/9] Refactoring exit Eric W. Biederman 2021-06-21 19:24 ` Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads Al Viro 2021-06-21 23:24 ` Michael Schmitz 2021-06-16 7:38 ` Geert Uytterhoeven 2021-06-16 19:40 ` Michael Schmitz 2021-06-12 23:38 ` [PATCH v1] m68k: save extra registers on sys_exit and sys_exit_group syscall entry Michael Schmitz 2021-06-13 19:59 ` Linus Torvalds 2021-06-13 20:07 ` Michael Schmitz 2021-06-13 20:26 ` Linus Torvalds 2021-06-13 20:33 ` Linus Torvalds 2021-06-13 20:47 ` Linus Torvalds 2021-06-14 7:13 ` Michael Schmitz 2021-06-14 7:40 ` Andreas Schwab 2021-06-14 8:19 ` Michael Schmitz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).