All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] signal/exit/ptrace changes for v5.17
@ 2022-01-14 23:59 Eric W. Biederman
  2022-01-17  4:08 ` Linus Torvalds
  2022-01-17  5:18 ` pr-tracker-bot
  0 siblings, 2 replies; 8+ messages in thread
From: Eric W. Biederman @ 2022-01-14 23:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Alexey Gladkov, Al Viro, Kees Cook, Oleg Nesterov


Linus,

Please pull the signal-for-v5.17 branch from the git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git signal-for-v5.17

  HEAD: a403df29789ba38796edb97dad9bfb47836b68c0 ptrace/m68k: Stop open coding ptrace_report_syscall


This set of changes deletes some dead code, makes a lot of cleanups
which hopefully make the code easier to follow, and fixes bugs
found along the way.

The end-game which I have not yet reached yet is for fatal signals that
generate coredumps to be short-circuit deliverable from complete_signal,
for force_siginfo_to_task not to require changing userspace configured
signal delivery state, and for the ptrace stops to always happen in
locations where we can guarantee on all architectures that the all of
the registers are saved and available on the stack.

Removal of profile_task_ext, profile_munmap, and profile_handoff_task
are the big successes for dead code removal this round.

A bunch of small bug fixes are included, as most of the issues reported
were small enough that they would not affect bisection so I simply added
the fixes and did not fold the fixes into the changes they were fixing.

There was a bug that broke coredumps piped to systemd-coredump.  I
dropped the change that caused that bug and replaced it entirely with
something much more restrained.  Unfortunately that required some
rebasing.

I am currently investigating to figure out if wake_up_interruptible
(instead of simply wake_up) ever makes sense outside of the signal code.

Some successes after this set of changes: There are few enough calls to
do_exit to audit in a reasonable amount of time.  The lifetime of struct
kthread now matches the lifetime of struct task, and the pointer to
struct kthread is no longer stored in set_child_tid.  The flag
SIGNAL_GROUP_COREDUMP is removed.  The field group_exit_task is removed.
Issues where task->exit_code was examined with signal->group_exit_code
should been examined were fixed.

There are several loosely related changes included because I am cleaning
up and if I don't include them they will probably get lost.

The original postings of these changes can be found at:
   https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.org
   https://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.org
   https://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org

I trimmed back the last set of changes to only the obviously correct
once.  Simply because there was less time for review than I had hoped.

I am sending this later than I would like as there was an issue that
was discovered just before the merge window and there is a big storm
coming through where I live.  Linus I hope your travel is going well.

Eric W. Biederman (37):
      exit/s390: Remove dead reference to do_exit from copy_thread
      exit: Add and use make_task_dead.
      exit: Move oops specific logic from do_exit into make_task_dead
      exit: Stop poorly open coding do_task_dead in make_task_dead
      exit: Stop exporting do_exit
      exit: Implement kthread_exit
      exit: Rename module_put_and_exit to module_put_and_kthread_exit
      exit: Rename complete_and_exit to kthread_complete_and_exit
      kthread: Ensure struct kthread is present for all kthreads
      exit/kthread: Move the exit code for kernel threads into struct kthread
      exit/kthread: Fix the kerneldoc comment for kthread_complete_and_exit
      objtool: Add a missing comma to avoid string concatenation
      fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
      fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacct
      kthread: Warn about failed allocations for the init kthread
      kthread: Never put_user the set_child_tid address
      kthread: Generalize pf_io_worker so it can point to struct kthread
      exit/xtensa: In arch/xtensa/entry.S:Linvalid_mask call make_task_dead
      exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
      exit: Move force_uaccess back into do_exit
      signal: Have the oom killer detect coredumps using signal->core_state
      signal: Have prepare_signal detect coredumps using signal->core_state
      signal: Make coredump handling explicit in complete_signal
      signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
      signal: Remove SIGNAL_GROUP_COREDUMP
      coredump: Stop setting signal->group_exit_task
      signal: Rename group_exit_task group_exec_task
      signal: Remove the helper signal_group_exit
      exit: Remove profile_task_exit & profile_munmap
      exit: Remove profile_handoff_task
      exit: Coredumps reach do_group_exit
      exit: Fix the exit_code for wait_task_zombie
      exit: Use the correct exit_code in /proc/<pid>/stat
      taskstats: Cleanup the use of task->exit_code
      ptrace: Remove second setting of PT_SEIZED in ptrace_attach
      ptrace: Remove unused regs argument from ptrace_report_syscall
      ptrace/m68k: Stop open coding ptrace_report_syscall

Nathan Chancellor (3):
      hexagon: Fix function name in die()
      h8300: Fix build errors from do_exit() to make_task_dead() transition
      csky: Fix function name in csky_alignment() and die()

Randy Dunlap (1):
      signal: clean up kernel-doc comments

 arch/alpha/kernel/traps.c                    |  6 +-
 arch/alpha/mm/fault.c                        |  2 +-
 arch/arm/kernel/traps.c                      |  2 +-
 arch/arm/mm/fault.c                          |  2 +-
 arch/arm64/kernel/traps.c                    |  2 +-
 arch/arm64/mm/fault.c                        |  2 +-
 arch/csky/abiv1/alignment.c                  |  2 +-
 arch/csky/kernel/traps.c                     |  2 +-
 arch/csky/mm/fault.c                         |  2 +-
 arch/h8300/kernel/traps.c                    |  3 +-
 arch/h8300/mm/fault.c                        |  2 +-
 arch/hexagon/kernel/traps.c                  |  2 +-
 arch/ia64/kernel/mca_drv.c                   |  2 +-
 arch/ia64/kernel/traps.c                     |  2 +-
 arch/ia64/mm/fault.c                         |  2 +-
 arch/m68k/kernel/ptrace.c                    | 12 +---
 arch/m68k/kernel/traps.c                     |  2 +-
 arch/m68k/mm/fault.c                         |  2 +-
 arch/microblaze/kernel/exceptions.c          |  4 +-
 arch/mips/kernel/traps.c                     |  2 +-
 arch/nds32/kernel/fpu.c                      |  2 +-
 arch/nds32/kernel/traps.c                    |  8 +--
 arch/nios2/kernel/traps.c                    |  4 +-
 arch/openrisc/kernel/traps.c                 |  2 +-
 arch/parisc/kernel/traps.c                   |  2 +-
 arch/powerpc/kernel/traps.c                  |  8 +--
 arch/riscv/kernel/traps.c                    |  2 +-
 arch/riscv/mm/fault.c                        |  2 +-
 arch/s390/kernel/dumpstack.c                 |  2 +-
 arch/s390/kernel/nmi.c                       |  2 +-
 arch/s390/kernel/process.c                   |  1 -
 arch/sh/kernel/traps.c                       |  2 +-
 arch/sparc/kernel/traps_32.c                 |  4 +-
 arch/sparc/kernel/traps_64.c                 |  4 +-
 arch/x86/entry/entry_32.S                    |  6 +-
 arch/x86/entry/entry_64.S                    |  6 +-
 arch/x86/kernel/dumpstack.c                  |  4 +-
 arch/xtensa/kernel/entry.S                   |  2 +-
 arch/xtensa/kernel/traps.c                   |  2 +-
 crypto/algboss.c                             |  4 +-
 drivers/net/wireless/rsi/rsi_91x_coex.c      |  2 +-
 drivers/net/wireless/rsi/rsi_91x_main.c      |  2 +-
 drivers/net/wireless/rsi/rsi_91x_sdio_ops.c  |  2 +-
 drivers/net/wireless/rsi/rsi_91x_usb_ops.c   |  2 +-
 drivers/pnp/pnpbios/core.c                   |  6 +-
 drivers/staging/rts5208/rtsx.c               | 16 ++---
 drivers/usb/atm/usbatm.c                     |  2 +-
 drivers/usb/gadget/function/f_mass_storage.c |  2 +-
 fs/cifs/connect.c                            |  2 +-
 fs/coredump.c                                | 14 ++--
 fs/exec.c                                    | 12 ++--
 fs/io-wq.c                                   |  6 +-
 fs/io-wq.h                                   |  2 +-
 fs/jffs2/background.c                        |  2 +-
 fs/nfs/callback.c                            |  4 +-
 fs/nfs/nfs4state.c                           |  2 +-
 fs/nfsd/nfssvc.c                             |  2 +-
 fs/proc/array.c                              |  6 +-
 include/linux/kernel.h                       |  1 -
 include/linux/kthread.h                      |  4 +-
 include/linux/module.h                       |  6 +-
 include/linux/profile.h                      | 45 -------------
 include/linux/sched.h                        |  4 +-
 include/linux/sched/signal.h                 | 18 +-----
 include/linux/sched/task.h                   |  1 +
 include/linux/tracehook.h                    |  7 +-
 kernel/exit.c                                | 97 +++++++++++++++-------------
 kernel/fork.c                                | 20 +++---
 kernel/futex/core.c                          |  2 +-
 kernel/kexec_core.c                          |  2 +-
 kernel/kthread.c                             | 88 +++++++++++++++++--------
 kernel/module.c                              |  6 +-
 kernel/profile.c                             | 73 ---------------------
 kernel/ptrace.c                              |  2 -
 kernel/sched/core.c                          | 16 ++---
 kernel/signal.c                              | 19 +++---
 kernel/tsacct.c                              |  7 +-
 lib/kunit/try-catch.c                        |  4 +-
 mm/mmap.c                                    |  1 -
 mm/oom_kill.c                                |  2 +-
 net/bluetooth/bnep/core.c                    |  2 +-
 net/bluetooth/cmtp/core.c                    |  2 +-
 net/bluetooth/hidp/core.c                    |  2 +-
 tools/objtool/check.c                        |  8 ++-
 84 files changed, 274 insertions(+), 377 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] signal/exit/ptrace changes for v5.17
  2022-01-14 23:59 [GIT PULL] signal/exit/ptrace changes for v5.17 Eric W. Biederman
@ 2022-01-17  4:08 ` Linus Torvalds
  2022-01-17  4:15   ` Linus Torvalds
  2022-01-17 15:31   ` Eric W. Biederman
  2022-01-17  5:18 ` pr-tracker-bot
  1 sibling, 2 replies; 8+ messages in thread
From: Linus Torvalds @ 2022-01-17  4:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Alexey Gladkov, Al Viro, Kees Cook,
	Oleg Nesterov

On Sat, Jan 15, 2022 at 2:00 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I am currently investigating to figure out if wake_up_interruptible
> (instead of simply wake_up) ever makes sense outside of the signal code.

It may not be a huge deal, but it *absolutely* makes sense.

Any subsystem that uses "wait_event_interruptible()" (or variations of
that) should by default only use "wake_up_interruptible()" to wake the
wait queue.

The reason? Being (interruptibly) on one wait-queue does *NOT* make it
impossible that the very same process isn't waiting non-interruptibly
on another one.

It's not hugely common, but the Linux kernel wait-queues have very
much been designed for the whole "one process can be on multiple wait
queues for different reasons at the same time" model. That is *very*
core.

People sometimes think that is just a "poll/select()" thing, but
that's not at all true. It's quite valid to do things like

        add_wait_queue(..)
        for (.. some loop ..) {
                set_current_state(TASK_INTERRUPTIBLE);
                ... do various things, checking state etc ..
               schedule();
        }
        set_current_state(TASK_RUNNABLE);
        remove_wait_queue();

and part of that "do various things" is obviously checking for signals
and other "exit this wait loop", but other things can be things like
taking a page fault because you copied data from user space etc.

And that in turn may end up having a nested wait on another waitqueue
(for the IO), and the outer wait queue should basically strive to not
wake up unrelated "deeper" sleeps, because that is pointless and just
causes extra wakeups.

So the page fault event will cause a nested TASK_UNINTERRUPTIBLE
sleep, and when that IO has completed, it goes into TASK_RUNNABLE, so
the outer (interruptible) loop above will have a "dummy schedule()"
and loop around again to be put back into TASK_INTERRUPTIBLE sleep
next time around.

But note how it would be pointless to use a "wake_up()" on this outer
wait queue - it would wake up the deeper IO sleep too, and that would
just see "oh, the IO I'm waiting for hasn't completed, so I'll just
have to go to sleep again".

Would it still _work_? Yes. Is the above _common_? No. But it's a
really fundamnetal pattern in the kernel, and it's fundamentally how
wait queues work, and how they should work, and an interruptible sleep
should generally be seen as pairing with an interruptible wake,
because that's just how things are.

Why would you want to change something basic like that? Yes, using
"wake_up()" instead of "wake_up_interruptible()" would still result in
a working kernel, but it's just _pointless_.

             Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] signal/exit/ptrace changes for v5.17
  2022-01-17  4:08 ` Linus Torvalds
@ 2022-01-17  4:15   ` Linus Torvalds
  2022-01-17 15:45     ` Eric W. Biederman
  2022-01-17 15:31   ` Eric W. Biederman
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2022-01-17  4:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Alexey Gladkov, Al Viro, Kees Cook,
	Oleg Nesterov

On Mon, Jan 17, 2022 at 6:08 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> People sometimes think that is just a "poll/select()" thing, but
> that's not at all true. It's quite valid to do things like
>
>         add_wait_queue(..)
>         for (.. some loop ..) {
>                 set_current_state(TASK_INTERRUPTIBLE);
>                 ... do various things, checking state etc ..
>                schedule();
>         }
>         set_current_state(TASK_RUNNABLE);
>         remove_wait_queue();

Of course, in most modern cases, the above sequence is actually
encoded as a "wait_event_interruptible()", because we don't generally
want to open-code the whole thing.

But the actual end result still ends up being exactly the same, it's
just syntactically denser and more legible version of the above thing,
and you can still have the "event" you wait on have nested waiting
situations.

The nested waiting is by no means common. The only _common_situation
where you're on multiple wait queues tends to be select/poll kind of
things, when they aren't really nested as much as iterated over, but
conceptually the nested case is still quite important, and it allows
you to do things that a traditional "wait_on()" interface with just
one single wait-queue just doesn't allow for.

                Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] signal/exit/ptrace changes for v5.17
  2022-01-14 23:59 [GIT PULL] signal/exit/ptrace changes for v5.17 Eric W. Biederman
  2022-01-17  4:08 ` Linus Torvalds
@ 2022-01-17  5:18 ` pr-tracker-bot
  1 sibling, 0 replies; 8+ messages in thread
From: pr-tracker-bot @ 2022-01-17  5:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, linux-kernel, Alexey Gladkov, Al Viro, Kees Cook,
	Oleg Nesterov

The pull request you sent on Fri, 14 Jan 2022 17:59:37 -0600:

> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git signal-for-v5.17

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/35ce8ae9ae2e471f92759f9d6880eab42cc1c3b6

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] signal/exit/ptrace changes for v5.17
  2022-01-17  4:08 ` Linus Torvalds
  2022-01-17  4:15   ` Linus Torvalds
@ 2022-01-17 15:31   ` Eric W. Biederman
  2022-01-17 15:44     ` Linus Torvalds
  1 sibling, 1 reply; 8+ messages in thread
From: Eric W. Biederman @ 2022-01-17 15:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Alexey Gladkov, Al Viro, Kees Cook,
	Oleg Nesterov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Jan 15, 2022 at 2:00 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> I am currently investigating to figure out if wake_up_interruptible
>> (instead of simply wake_up) ever makes sense outside of the signal code.
>
> It may not be a huge deal, but it *absolutely* makes sense.
>
> Any subsystem that uses "wait_event_interruptible()" (or variations of
> that) should by default only use "wake_up_interruptible()" to wake the
> wait queue.
>
> The reason? Being (interruptibly) on one wait-queue does *NOT* make it
> impossible that the very same process isn't waiting non-interruptibly
> on another one.
>
> It's not hugely common, but the Linux kernel wait-queues have very
> much been designed for the whole "one process can be on multiple wait
> queues for different reasons at the same time" model. That is *very*
> core.
>
> People sometimes think that is just a "poll/select()" thing, but
> that's not at all true. It's quite valid to do things like
>
>         add_wait_queue(..)
>         for (.. some loop ..) {
>                 set_current_state(TASK_INTERRUPTIBLE);
>                 ... do various things, checking state etc ..
>                schedule();
>         }
>         set_current_state(TASK_RUNNABLE);
>         remove_wait_queue();
>
> and part of that "do various things" is obviously checking for signals
> and other "exit this wait loop", but other things can be things like
> taking a page fault because you copied data from user space etc.
>
> And that in turn may end up having a nested wait on another waitqueue
> (for the IO), and the outer wait queue should basically strive to not
> wake up unrelated "deeper" sleeps, because that is pointless and just
> causes extra wakeups.
>
> So the page fault event will cause a nested TASK_UNINTERRUPTIBLE
> sleep, and when that IO has completed, it goes into TASK_RUNNABLE, so
> the outer (interruptible) loop above will have a "dummy schedule()"
> and loop around again to be put back into TASK_INTERRUPTIBLE sleep
> next time around.
>
> But note how it would be pointless to use a "wake_up()" on this outer
> wait queue - it would wake up the deeper IO sleep too, and that would
> just see "oh, the IO I'm waiting for hasn't completed, so I'll just
> have to go to sleep again".
>
> Would it still _work_? Yes. Is the above _common_? No. But it's a
> really fundamnetal pattern in the kernel, and it's fundamentally how
> wait queues work, and how they should work, and an interruptible sleep
> should generally be seen as pairing with an interruptible wake,
> because that's just how things are.
>
> Why would you want to change something basic like that? Yes, using
> "wake_up()" instead of "wake_up_interruptible()" would still result in
> a working kernel, but it's just _pointless_.

Thank you for the detailed reply.  I am going to have to spend a little
bit digesting it.

They why is that I am digging into how to reliably test for and get
just the wakeups needed in the coredump code.

As a first approximation writing to files and causing dump_interrupted
to change for signal_pending to fatal_sending_pending worked like a
charm.  Unfortunately the pipe code is still performing interruptible
waits and io_uring causes truncated coredumps.

I would like to have a version of pipe_write that sleeps in
TASK_KILLABLE.  I want the I/O wake-ups and I want the SIGKILL wake ups
but I don't want any other wake-ups.  Unfortunately the I/O wake-ups in
the pipe code are sent with wake_up_interruptible.  So a task sleeping
in TASK_KILLABLE won't get them.

Which means that the obvious solution of changing
wait_event_interruptible to wake_event_killable breaks coredump support
(as I found out the hard way).

I understand things well enough that I can change the signal code and
not make the coredump code worse.  I am still trying to figure out what
a clean maintainable solution for coredumps writing to pipes is going to
be.  So I will dig in and read the code that has the nested wait queues
and see if I can understand that logic, and keep thinking about the
coredump support.

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] signal/exit/ptrace changes for v5.17
  2022-01-17 15:31   ` Eric W. Biederman
@ 2022-01-17 15:44     ` Linus Torvalds
  2022-01-17 16:31       ` Eric W. Biederman
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2022-01-17 15:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Alexey Gladkov, Al Viro, Kees Cook,
	Oleg Nesterov

On Mon, Jan 17, 2022 at 5:32 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I would like to have a version of pipe_write that sleeps in
> TASK_KILLABLE.

That would actually be horrible for another reason - now it would
count towards the load average. That's another difference between
interruptible waits and non-interruptible ones.

Admittedly it's an entirely arbitrary one, but it's part of the whole
semantic difference between TASK_INTERRUPTIBLE and
TASK_UNINTERRUPTIBLE.

You can play with TASK_NOLOAD of course, so it's something that can be
worked around, but it gets a bit ugly.

>  I want the I/O wake-ups and I want the SIGKILL wake ups
> but I don't want any other wake-ups.  Unfortunately the I/O wake-ups in
> the pipe code are sent with wake_up_interruptible.  So a task sleeping
> in TASK_KILLABLE won't get them.

Yeah. The code *could* use the non-interruptible 'wake_up()', and
everything should work - because waking things up too much doesn't
change semantics, it's just a slight pessimization. Plus the whole
"nested waitqueues" isn't actually any remotely normal case, so it
doesn't really matter for performance either.

But I really think it's wrong.

You're trying to work around a problem the wrong way around. If a task
is dead, and is dumping core, then signals just shouldn't matter in
the first place, and thus the whole "TASK_INTERRUPTIBLE vs
TASK_UNINTERRUPTIBLE" really shouldn't be an issue. The fact that it
is an issue means there's something wrong in signaling, not in the
pipe code.

So I really think that's where the fix should be - on the signal delivery side.

             Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] signal/exit/ptrace changes for v5.17
  2022-01-17  4:15   ` Linus Torvalds
@ 2022-01-17 15:45     ` Eric W. Biederman
  0 siblings, 0 replies; 8+ messages in thread
From: Eric W. Biederman @ 2022-01-17 15:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Alexey Gladkov, Al Viro, Kees Cook,
	Oleg Nesterov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jan 17, 2022 at 6:08 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>>
>> People sometimes think that is just a "poll/select()" thing, but
>> that's not at all true. It's quite valid to do things like
>>
>>         add_wait_queue(..)
>>         for (.. some loop ..) {
>>                 set_current_state(TASK_INTERRUPTIBLE);
>>                 ... do various things, checking state etc ..
>>                schedule();
>>         }
>>         set_current_state(TASK_RUNNABLE);
>>         remove_wait_queue();
>
> Of course, in most modern cases, the above sequence is actually
> encoded as a "wait_event_interruptible()", because we don't generally
> want to open-code the whole thing.

Yes.

What I was looking at that inspired the question is that
"wake_up" ultimately expands to "try_to_wake_up(task, TASK_NORMAL, 0)".

Whereas "wake_up_interruptible" expands to
"try_to_wake_up(task, TASK_INTERRUPTIBLE, 0)".

With the practical challenge that if I want to change
wait_event_interruptible to wait_event_killable I need to change all of
the wakers.

> But the actual end result still ends up being exactly the same, it's
> just syntactically denser and more legible version of the above thing,
> and you can still have the "event" you wait on have nested waiting
> situations.
>
> The nested waiting is by no means common. The only _common_situation
> where you're on multiple wait queues tends to be select/poll kind of
> things, when they aren't really nested as much as iterated over, but
> conceptually the nested case is still quite important, and it allows
> you to do things that a traditional "wait_on()" interface with just
> one single wait-queue just doesn't allow for.

I think it may just be the part of the kernel where I usually work.
Changing wait_event_interruptible to wait_event_killable has always just
worked for me, but it doesn't in the pipe code.

It doesn't because of wake_up_interruptible.

I do know that short-term-disk-sleep aka task_uninterruptible is special
to performing things like disk I/O, and really short term things.

It might just be the names but I look at wake_up_interruptible and my
klaxon's go off in my head saying something doesn't make sense.  So I
will read up and look at those nested wait-queue scenarios and see if I
can find the piece of understanding I am missing.

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] signal/exit/ptrace changes for v5.17
  2022-01-17 15:44     ` Linus Torvalds
@ 2022-01-17 16:31       ` Eric W. Biederman
  0 siblings, 0 replies; 8+ messages in thread
From: Eric W. Biederman @ 2022-01-17 16:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Alexey Gladkov, Al Viro, Kees Cook,
	Oleg Nesterov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jan 17, 2022 at 5:32 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> I would like to have a version of pipe_write that sleeps in
>> TASK_KILLABLE.
>
> That would actually be horrible for another reason - now it would
> count towards the load average. That's another difference between
> interruptible waits and non-interruptible ones.
>
> Admittedly it's an entirely arbitrary one, but it's part of the whole
> semantic difference between TASK_INTERRUPTIBLE and
> TASK_UNINTERRUPTIBLE.
>
> You can play with TASK_NOLOAD of course, so it's something that can be
> worked around, but it gets a bit ugly.

Yes.  I don't want to make a change that changes the load average.

>>  I want the I/O wake-ups and I want the SIGKILL wake ups
>> but I don't want any other wake-ups.  Unfortunately the I/O wake-ups in
>> the pipe code are sent with wake_up_interruptible.  So a task sleeping
>> in TASK_KILLABLE won't get them.
>
> Yeah. The code *could* use the non-interruptible 'wake_up()', and
> everything should work - because waking things up too much doesn't
> change semantics, it's just a slight pessimization. Plus the whole
> "nested waitqueues" isn't actually any remotely normal case, so it
> doesn't really matter for performance either.
>
> But I really think it's wrong.
>
> You're trying to work around a problem the wrong way around. If a task
> is dead, and is dumping core, then signals just shouldn't matter in
> the first place, and thus the whole "TASK_INTERRUPTIBLE vs
> TASK_UNINTERRUPTIBLE" really shouldn't be an issue. The fact that it
> is an issue means there's something wrong in signaling, not in the
> pipe code.
>
> So I really think that's where the fix should be - on the signal delivery side.

The actual signaling is shutdown, (except for the special case of
SIGKILL being able to terminate the coredump).  It is io_uring and
anything else that is not a signal that causes signal_pending() to
return true.

I have not found any solution I am happy with yet, I am just
brainstorming.

Part of the problem is that I really don't want to perform process
shutdown and remove evidence of why the process crashed.  So maybe
shutting down io_uring is fine in that case but I don't like that either.

The more I look at all of the interesting corner cases the more I wonder
if the solution isn't to have the coredump code fork a kernel-only
userspace process (like the io_uring threads are kernel-only userspace
threads).  That would at least allow kernel functionality to work like
normal and greatly reduce the chance of weird feature interactions.

Hmm. A special kernel-only thread might even be enough as io_uring would
not be directing task_work at it.

You have been me some good information and I think I just need to sleep
on this problem a bit more to come up with a non-hacky solution.

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-01-17 16:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-14 23:59 [GIT PULL] signal/exit/ptrace changes for v5.17 Eric W. Biederman
2022-01-17  4:08 ` Linus Torvalds
2022-01-17  4:15   ` Linus Torvalds
2022-01-17 15:45     ` Eric W. Biederman
2022-01-17 15:31   ` Eric W. Biederman
2022-01-17 15:44     ` Linus Torvalds
2022-01-17 16:31       ` Eric W. Biederman
2022-01-17  5:18 ` pr-tracker-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.