linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Bug 200447] infinite loop in fork syscall
       [not found]                   ` <20180710134639.GA2453@redhat.com>
@ 2018-07-10 16:00                     ` Eric W. Biederman
  2018-07-11 12:08                       ` Oleg Nesterov
       [not found]                     ` <CA+55aFxcjSYAj-CZFEuDwiwZyMg+prV_jeP_Vuh06RJA0BboSw@mail.gmail.com>
  1 sibling, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-10 16:00 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linus Torvalds, Andrew Morton, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/09, Linus Torvalds wrote:
>>
>> But the patch was written for testing and feedback more than anything
>> else. Comments?
>
> see my reply on bugzilla. Can't we add lkml?
Done.

> Perhaps we can do another change? Not sure it is actually better, but I think
> it is always good to discuss the alternatives.
>
> 1. Once again, we turn "int group" argument into thread/process/pgid enum, and
>    change kill_pgrp/tty_signal_session_leader to pass "group = pgid", imo this
>    makes sense in any case.

I agree.  There are a lot more multi-process signals cases than handled
in Linus's patch.   I believe this will clean that up.

> 2. To simplify, lets suppose we add the new PF_INFORK flag. Yes, this is bad,
>    we can do better. I think we can simply add "struct hlist_head forking_threads"
>    into signal_struct, so complete_signal() can just do hlist_for_each_entry()
>    rather than for_each_thread() + PF_INFORK check. We don't even need a new
>    member in task_struct.

We still need the distinction between multi-process signals and single
process signals (which is the hard part).  For good performance of
signal delivery to multi-threaded tasks we still need a new member in
signal_struct.  Plus it is a bit more work to update the list or even
walk the list than a sequence counter.

So I think adding a sequence counter to let us know about multiprocess
signals is the local optimum.

If we ever need to remove the restart case entirely from fork and queue
all new multi-process signals for the newly created task.  Going through
the list of forking

>
> 3. copy_process() can simply block/unblock all signals (except KILL/STOP), see
>    the "patch" below.

All signals are effectively blocked for the duration of the fork for the
calling task.    Where we get into trouble and where we need a fix for
correctness is that another thread can dequeue the signal.   Blocking
signals of the forking task does not change that.

I think that reveals another bug in our current logic.  For blocked
multi-process signals we don't ensure they get delivered to both the
parent and the child if the signal logically comes in after the fork.

Multi-threaded b0rked ness and blocked signal b0rkenness, plus periodic
timers making for take forever for no good reason.  This business of
ensuring a signal is logically delvered before or after a fork is
tricky.

Eric


> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9440d61..652ef09 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1591,6 +1591,23 @@ static __latent_entropy struct task_struct *copy_process(
>  	int retval;
>  	struct task_struct *p;
>  
> +	spin_lock_irq(&current->sighand->siglock);
> +	recalc_sigpending_tsk(current);
> +	if (signal_pending(current)) {
> +		retval = -ERESTARTNOINTR;
> +	} else {
> +		// Yes, yes, this is wrong, just to explain the idea.
> +		// We should not block SIGKILL, we need to clear PF_INFORK
> +		// and restore ->blocked in error paths, we need helper(s).
> +		retval = 0;
> +		current->flags |= PF_INFORK;
> +		current->saved_sigmask = current->blocked;
> +		sigfillset(&current->blocked);
> +	}
> +	spin_unlock_irq(&current->sighand->siglock);
> +	if (retval)
> +		return retval;
> +
>  	/*
>  	 * Don't allow sharing the root directory with processes in a different
>  	 * namespace
> @@ -1918,8 +1935,14 @@ static __latent_entropy struct task_struct *copy_process(
>  	 * A fatal signal pending means that current will exit, so the new
>  	 * thread can't slip out of an OOM kill (or normal SIGKILL).
>  	*/
> -	recalc_sigpending();
> -	if (signal_pending(current)) {
> +	// check signal_pending() before __set_task_blocked() which does
> +	// recalc_sigpending(). Again, this is just to explain the idea,
> +	// __set_task_blocked() is not exported, it is suboptimal, we can
> +	// do better.
> +	bool eintr = signal_pending();
> +	current->flags &= ~PF_INFORK;
> +	__set_task_blocked(current, &current->saved_sigmask);
> +	if (eintr) {
>  		retval = -ERESTARTNOINTR;
>  		goto bad_fork_cancel_cgroup;
>  	}
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8d8a940..3433e66 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -900,6 +900,14 @@ static void complete_signal(int sig, struct task_struct *p, int group)
>  	struct signal_struct *signal = p->signal;
>  	struct task_struct *t;
>  
> +	// kill_pgrp/etc.
> +	if (group == 2) {
> +		for_each_thread(p, t) {
> +			if (p->flags & PF_INFORK && !sigismember(&p->saved_sigmask, sig))
> +				signal_wake_up(t, 0);
> +		}
> +	}
> +
>  	/*
>  	 * Now find a thread we can wake up to take the signal off the queue.
>  	 *

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts
       [not found]                     ` <CA+55aFxcjSYAj-CZFEuDwiwZyMg+prV_jeP_Vuh06RJA0BboSw@mail.gmail.com>
@ 2018-07-11  2:41                       ` Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 01/11] pids: Initialize leader_pid in init_task Eric W. Biederman
                                           ` (11 more replies)
  0 siblings, 12 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang


The following patches should be close.  I took some patches I haven't
taken the time to merge yet that make PIDTYPE_TGID not a hack.

Updated the code that deals with signals to handle PIDTYPE_TGID.

Pushed the pid type down from the signal senders all of the way
down into __send_signal.  That work could probably use being
split into more than one patch for readability, but it seems
reasonble and less of a hack than the "bool group" we have
currently.

I think I have gotten all of the places we send signals to multiple
processes.  But I have yet to make an exhaustive examination.  I would
appreciate some review feedback before I burn a day doing that.

All in all this changes a little more than I might hope for but it seems
a nicely targted cleanup that sorts out the fork issue.

Comments please.

I think I am 99% of the way to solving this cleanly but any feedback would
be very appreciated.

Thank you in advance.

Eric W. Biederman (11):
      pids: Initialize leader_pid in init_task
      pids: Move task_pid_type into sched/signal.h
      pids: Compute task_tgid using signal->leader_pid
      kvm: Don't open code task_pid in kvm_vcpu_ioctl
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct
      pid: Implement PIDTYPE_TGID
      signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
      signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
      tty_io: Use do_send_sig_info in __do_SACK  to forcibly kill tasks
      signal: Push pid type from signal senders down into __send_signal
      signal: Ignore all but multi-process signals that come in during fork.

 arch/ia64/kernel/asm-offsets.c       |  4 +--
 arch/ia64/kernel/fsys.S              | 12 +++----
 arch/s390/kernel/perf_cpum_sf.c      |  2 +-
 drivers/net/tun.c                    |  2 +-
 drivers/platform/x86/thinkpad_acpi.c |  1 +
 drivers/tty/sysrq.c                  |  2 +-
 drivers/tty/tty_io.c                 | 10 +++---
 fs/autofs/autofs_i.h                 |  1 +
 fs/exec.c                            |  1 +
 fs/fcntl.c                           | 38 ++++++++--------------
 fs/fuse/file.c                       |  1 +
 fs/locks.c                           |  2 +-
 fs/notify/dnotify/dnotify.c          |  3 +-
 fs/notify/fanotify/fanotify.c        |  1 +
 include/linux/init_task.h            |  9 ------
 include/linux/pid.h                  | 11 ++-----
 include/linux/sched.h                | 31 ++++--------------
 include/linux/sched/signal.h         | 39 +++++++++++++++++++++--
 include/linux/signal.h               |  6 ++--
 include/net/scm.h                    |  1 +
 include/trace/events/signal.h        | 12 +++----
 init/init_task.c                     | 12 ++++---
 kernel/events/core.c                 |  2 +-
 kernel/exit.c                        | 12 ++-----
 kernel/fork.c                        | 45 +++++++++++++++++++++-----
 kernel/pid.c                         | 42 ++++++++++++-------------
 kernel/signal.c                      | 61 ++++++++++++++++++++----------------
 kernel/time/itimer.c                 |  5 +--
 kernel/time/posix-cpu-timers.c       |  2 +-
 kernel/time/posix-timers.c           | 13 +++-----
 mm/oom_kill.c                        |  4 +--
 virt/kvm/kvm_main.c                  |  2 +-
 32 files changed, 205 insertions(+), 184 deletions(-)


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 01/11] pids: Initialize leader_pid in init_task
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 02/11] pids: Move task_pid_type into sched/signal.h Eric W. Biederman
                                           ` (10 subsequent siblings)
  11 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

This is cheap and no cost so we might as well.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 init/init_task.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/init/init_task.c b/init/init_task.c
index 74f60baa2799..7914ffb8dc73 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -33,6 +33,7 @@ static struct signal_struct init_signals = {
 	},
 #endif
 	INIT_CPU_TIMERS(init_signals)
+	.leader_pid = &init_struct_pid,
 	INIT_PREV_CPUTIME(init_signals)
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 02/11] pids: Move task_pid_type into sched/signal.h
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 01/11] pids: Initialize leader_pid in init_task Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 03/11] pids: Compute task_tgid using signal->leader_pid Eric W. Biederman
                                           ` (9 subsequent siblings)
  11 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

The function is general and inline so there is no need
to hide it inside of exit.c

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h | 8 ++++++++
 kernel/exit.c                | 8 --------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 113d1ad1ced7..d8ef0a3d2e7e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -556,6 +556,14 @@ extern bool current_is_single_threaded(void);
 typedef int (*proc_visitor)(struct task_struct *p, void *data);
 void walk_process_tree(struct task_struct *top, proc_visitor, void *);
 
+static inline
+struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
+{
+	if (type != PIDTYPE_PID)
+		task = task->group_leader;
+	return task->pids[type].pid;
+}
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
 	return tsk->signal->nr_threads;
diff --git a/kernel/exit.c b/kernel/exit.c
index c3c7ac560114..16432428fc6c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1001,14 +1001,6 @@ struct wait_opts {
 	int			notask_error;
 };
 
-static inline
-struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
-{
-	if (type != PIDTYPE_PID)
-		task = task->group_leader;
-	return task->pids[type].pid;
-}
-
 static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
 {
 	return	wo->wo_type == PIDTYPE_MAX ||
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 03/11] pids: Compute task_tgid using signal->leader_pid
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 01/11] pids: Initialize leader_pid in init_task Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 02/11] pids: Move task_pid_type into sched/signal.h Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 04/11] kvm: Don't open code task_pid in kvm_vcpu_ioctl Eric W. Biederman
                                           ` (8 subsequent siblings)
  11 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.

__task_pid_nr_ns has been updated to take advantage of this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/ia64/kernel/asm-offsets.c       |  2 +-
 arch/ia64/kernel/fsys.S              |  8 ++++----
 drivers/platform/x86/thinkpad_acpi.c |  1 +
 fs/fuse/file.c                       |  1 +
 fs/notify/fanotify/fanotify.c        |  1 +
 include/linux/sched.h                |  5 -----
 include/linux/sched/signal.h         |  5 +++++
 include/net/scm.h                    |  1 +
 kernel/pid.c                         | 15 ++++++++-------
 9 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c
index f4db2168d1b8..f5433bb7f04a 100644
--- a/arch/ia64/kernel/asm-offsets.c
+++ b/arch/ia64/kernel/asm-offsets.c
@@ -50,7 +50,6 @@ void foo(void)
 
 	DEFINE(IA64_TASK_BLOCKED_OFFSET,offsetof (struct task_struct, blocked));
 	DEFINE(IA64_TASK_CLEAR_CHILD_TID_OFFSET,offsetof (struct task_struct, clear_child_tid));
-	DEFINE(IA64_TASK_GROUP_LEADER_OFFSET, offsetof (struct task_struct, group_leader));
 	DEFINE(IA64_TASK_TGIDLINK_OFFSET, offsetof (struct task_struct, pids[PIDTYPE_PID].pid));
 	DEFINE(IA64_PID_LEVEL_OFFSET, offsetof (struct pid, level));
 	DEFINE(IA64_PID_UPID_OFFSET, offsetof (struct pid, numbers[0]));
@@ -68,6 +67,7 @@ void foo(void)
 	DEFINE(IA64_SIGNAL_GROUP_STOP_COUNT_OFFSET,offsetof (struct signal_struct,
 							     group_stop_count));
 	DEFINE(IA64_SIGNAL_SHARED_PENDING_OFFSET,offsetof (struct signal_struct, shared_pending));
+	DEFINE(IA64_SIGNAL_LEADER_PID_OFFSET, offsetof (struct signal_struct, leader_pid));
 
 	BLANK();
 
diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S
index fe742ffafc7a..eaf5a0d6f3e0 100644
--- a/arch/ia64/kernel/fsys.S
+++ b/arch/ia64/kernel/fsys.S
@@ -62,16 +62,16 @@ ENTRY(fsys_getpid)
 	.prologue
 	.altrp b6
 	.body
-	add r17=IA64_TASK_GROUP_LEADER_OFFSET,r16
+	add r17=IA64_TASK_SIGNAL_OFFSET,r16
 	;;
-	ld8 r17=[r17]				// r17 = current->group_leader
+	ld8 r17=[r17]				// r17 = current->signal
 	add r9=TI_FLAGS+IA64_TASK_SIZE,r16
 	;;
 	ld4 r9=[r9]
-	add r17=IA64_TASK_TGIDLINK_OFFSET,r17
+	add r17=IA64_SIGNAL_LEADER_PID_OFFSET,r17
 	;;
 	and r9=TIF_ALLWORK_MASK,r9
-	ld8 r17=[r17]				// r17 = current->group_leader->pids[PIDTYPE_PID].pid
+	ld8 r17=[r17]				// r17 = current->signal->leader_pid
 	;;
 	add r8=IA64_PID_LEVEL_OFFSET,r17
 	;;
diff --git a/drivers/platform/x86/thinkpad_acpi.c b/drivers/platform/x86/thinkpad_acpi.c
index cae9b0595692..d556e95c532c 100644
--- a/drivers/platform/x86/thinkpad_acpi.c
+++ b/drivers/platform/x86/thinkpad_acpi.c
@@ -57,6 +57,7 @@
 #include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/sched.h>
+#include <linux/sched/signal.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/delay.h>
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a201fb0ac64f..b00a3f126a89 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -12,6 +12,7 @@
 #include <linux/slab.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sched/signal.h>
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/swap.h>
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index f90842efea13..6e828cb82e5e 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -8,6 +8,7 @@
 #include <linux/mount.h>
 #include <linux/sched.h>
 #include <linux/sched/user.h>
+#include <linux/sched/signal.h>
 #include <linux/types.h>
 #include <linux/wait.h>
 #include <linux/audit.h>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..a461ff89a3af 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1202,11 +1202,6 @@ static inline struct pid *task_pid(struct task_struct *task)
 	return task->pids[PIDTYPE_PID].pid;
 }
 
-static inline struct pid *task_tgid(struct task_struct *task)
-{
-	return task->group_leader->pids[PIDTYPE_PID].pid;
-}
-
 /*
  * Without tasklist or RCU lock it is not safe to dereference
  * the result of task_pgrp/task_session even if task == current,
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index d8ef0a3d2e7e..b95a272c1ab5 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -564,6 +564,11 @@ struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
 	return task->pids[type].pid;
 }
 
+static inline struct pid *task_tgid(struct task_struct *task)
+{
+	return task->signal->leader_pid;
+}
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
 	return tsk->signal->nr_threads;
diff --git a/include/net/scm.h b/include/net/scm.h
index 903771c8d4e3..1ce365f4c256 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -8,6 +8,7 @@
 #include <linux/security.h>
 #include <linux/pid.h>
 #include <linux/nsproxy.h>
+#include <linux/sched/signal.h>
 
 /* Well, we should have at least one descriptor open
  * to accept passed FDs 8)
diff --git a/kernel/pid.c b/kernel/pid.c
index 157fe4b19971..d0de2b59f86f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -421,13 +421,14 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 	if (!ns)
 		ns = task_active_pid_ns(current);
 	if (likely(pid_alive(task))) {
-		if (type != PIDTYPE_PID) {
-			if (type == __PIDTYPE_TGID)
-				type = PIDTYPE_PID;
-
-			task = task->group_leader;
-		}
-		nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
+		struct pid *pid;
+		if (type == PIDTYPE_PID)
+			pid = task_pid(task);
+		else if (type == __PIDTYPE_TGID)
+			pid = task_tgid(task);
+		else
+			pid = rcu_dereference(task->group_leader->pids[type].pid);
+		nr = pid_nr_ns(pid, ns);
 	}
 	rcu_read_unlock();
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 04/11] kvm: Don't open code task_pid in kvm_vcpu_ioctl
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (2 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 03/11] pids: Compute task_tgid using signal->leader_pid Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 05/11] pids: Move the pgrp and session pid pointers from task_struct to signal_struct Eric W. Biederman
                                           ` (7 subsequent siblings)
  11 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..4c593acc4510 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2560,7 +2560,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
 		if (arg)
 			goto out;
 		oldpid = rcu_access_pointer(vcpu->pid);
-		if (unlikely(oldpid != current->pids[PIDTYPE_PID].pid)) {
+		if (unlikely(oldpid != task_pid(current))) {
 			/* The thread running this VCPU changed. */
 			struct pid *newpid;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 05/11] pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (3 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 04/11] kvm: Don't open code task_pid in kvm_vcpu_ioctl Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-17 11:59                           ` Oleg Nesterov
  2018-07-11  2:44                         ` [RFC][PATCH 06/11] pid: Implement PIDTYPE_TGID Eric W. Biederman
                                           ` (6 subsequent siblings)
  11 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.

This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/ia64/kernel/asm-offsets.c |  2 +-
 arch/ia64/kernel/fsys.S        |  4 +--
 fs/autofs/autofs_i.h           |  1 +
 include/linux/init_task.h      |  9 -------
 include/linux/pid.h            |  8 +-----
 include/linux/sched.h          | 22 +++--------------
 include/linux/sched/signal.h   | 26 +++++++++++++++++---
 init/init_task.c               | 11 +++++----
 kernel/fork.c                  | 23 +++++++++++++----
 kernel/pid.c                   | 45 +++++++++++++++++-----------------
 10 files changed, 78 insertions(+), 73 deletions(-)

diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c
index f5433bb7f04a..c1f8a57855af 100644
--- a/arch/ia64/kernel/asm-offsets.c
+++ b/arch/ia64/kernel/asm-offsets.c
@@ -50,7 +50,7 @@ void foo(void)
 
 	DEFINE(IA64_TASK_BLOCKED_OFFSET,offsetof (struct task_struct, blocked));
 	DEFINE(IA64_TASK_CLEAR_CHILD_TID_OFFSET,offsetof (struct task_struct, clear_child_tid));
-	DEFINE(IA64_TASK_TGIDLINK_OFFSET, offsetof (struct task_struct, pids[PIDTYPE_PID].pid));
+	DEFINE(IA64_TASK_THREAD_PID_OFFSET,offsetof (struct task_struct, thread_pid));
 	DEFINE(IA64_PID_LEVEL_OFFSET, offsetof (struct pid, level));
 	DEFINE(IA64_PID_UPID_OFFSET, offsetof (struct pid, numbers[0]));
 	DEFINE(IA64_TASK_PENDING_OFFSET,offsetof (struct task_struct, pending));
diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S
index eaf5a0d6f3e0..e85ebdac678b 100644
--- a/arch/ia64/kernel/fsys.S
+++ b/arch/ia64/kernel/fsys.S
@@ -96,11 +96,11 @@ ENTRY(fsys_set_tid_address)
 	.altrp b6
 	.body
 	add r9=TI_FLAGS+IA64_TASK_SIZE,r16
-	add r17=IA64_TASK_TGIDLINK_OFFSET,r16
+	add r17=IA64_TASK_THREAD_PID_OFFSET,r16
 	;;
 	ld4 r9=[r9]
 	tnat.z p6,p7=r32		// check argument register for being NaT
-	ld8 r17=[r17]				// r17 = current->pids[PIDTYPE_PID].pid
+	ld8 r17=[r17]				// r17 = current->thread_pid
 	;;
 	and r9=TIF_ALLWORK_MASK,r9
 	add r8=IA64_PID_LEVEL_OFFSET,r17
diff --git a/fs/autofs/autofs_i.h b/fs/autofs/autofs_i.h
index 9400a9f6318a..502812289850 100644
--- a/fs/autofs/autofs_i.h
+++ b/fs/autofs/autofs_i.h
@@ -18,6 +18,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
+#include <linux/sched/signal.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/uaccess.h>
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index a454b8aeb938..a7083a45a26c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -46,15 +46,6 @@ extern struct cred init_cred;
 #define INIT_CPU_TIMERS(s)
 #endif
 
-#define INIT_PID_LINK(type) 					\
-{								\
-	.node = {						\
-		.next = NULL,					\
-		.pprev = NULL,					\
-	},							\
-	.pid = &init_struct_pid,				\
-}
-
 #define INIT_TASK_COMM "swapper"
 
 /* Attach to the init_task data structure for proper alignment */
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7633d55d9a24..3d4c504dcc8c 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,12 +67,6 @@ struct pid
 
 extern struct pid init_struct_pid;
 
-struct pid_link
-{
-	struct hlist_node node;
-	struct pid *pid;
-};
-
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
@@ -177,7 +171,7 @@ pid_t pid_vnr(struct pid *pid);
 	do {								\
 		if ((pid) != NULL)					\
 			hlist_for_each_entry_rcu((task),		\
-				&(pid)->tasks[type], pids[type].node) {
+				&(pid)->tasks[type], pid_links[type]) {
 
 			/*
 			 * Both old and new leaders may be attached to
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a461ff89a3af..445bdf5b1f64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,8 @@ struct task_struct {
 	struct list_head		ptrace_entry;
 
 	/* PID/PID hash table linkage. */
-	struct pid_link			pids[PIDTYPE_MAX];
+	struct pid			*thread_pid;
+	struct hlist_node		pid_links[PIDTYPE_MAX];
 	struct list_head		thread_group;
 	struct list_head		thread_node;
 
@@ -1199,22 +1200,7 @@ struct task_struct {
 
 static inline struct pid *task_pid(struct task_struct *task)
 {
-	return task->pids[PIDTYPE_PID].pid;
-}
-
-/*
- * Without tasklist or RCU lock it is not safe to dereference
- * the result of task_pgrp/task_session even if task == current,
- * we can race with another thread doing sys_setsid/sys_setpgid.
- */
-static inline struct pid *task_pgrp(struct task_struct *task)
-{
-	return task->group_leader->pids[PIDTYPE_PGID].pid;
-}
-
-static inline struct pid *task_session(struct task_struct *task)
-{
-	return task->group_leader->pids[PIDTYPE_SID].pid;
+	return task->thread_pid;
 }
 
 /*
@@ -1263,7 +1249,7 @@ static inline pid_t task_tgid_nr(struct task_struct *tsk)
  */
 static inline int pid_alive(const struct task_struct *p)
 {
-	return p->pids[PIDTYPE_PID].pid != NULL;
+	return p->thread_pid != NULL;
 }
 
 static inline pid_t task_pgrp_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index b95a272c1ab5..2dcded16eb1e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -146,7 +146,9 @@ struct signal_struct {
 
 #endif
 
+	/* PID/PID hash table linkage. */
 	struct pid *leader_pid;
+	struct pid *pids[PIDTYPE_MAX];
 
 #ifdef CONFIG_NO_HZ_FULL
 	atomic_t tick_dep_mask;
@@ -559,9 +561,12 @@ void walk_process_tree(struct task_struct *top, proc_visitor, void *);
 static inline
 struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
 {
-	if (type != PIDTYPE_PID)
-		task = task->group_leader;
-	return task->pids[type].pid;
+	struct pid *pid;
+	if (type == PIDTYPE_PID)
+		pid = task_pid(task);
+	else
+		pid = task->signal->pids[type];
+	return pid;
 }
 
 static inline struct pid *task_tgid(struct task_struct *task)
@@ -569,6 +574,21 @@ static inline struct pid *task_tgid(struct task_struct *task)
 	return task->signal->leader_pid;
 }
 
+/*
+ * Without tasklist or RCU lock it is not safe to dereference
+ * the result of task_pgrp/task_session even if task == current,
+ * we can race with another thread doing sys_setsid/sys_setpgid.
+ */
+static inline struct pid *task_pgrp(struct task_struct *task)
+{
+	return task->signal->pids[PIDTYPE_PGID];
+}
+
+static inline struct pid *task_session(struct task_struct *task)
+{
+	return task->signal->pids[PIDTYPE_SID];
+}
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
 	return tsk->signal->nr_threads;
diff --git a/init/init_task.c b/init/init_task.c
index 7914ffb8dc73..db12a61259f1 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -34,6 +34,11 @@ static struct signal_struct init_signals = {
 #endif
 	INIT_CPU_TIMERS(init_signals)
 	.leader_pid = &init_struct_pid,
+	.pids = {
+		[PIDTYPE_PID]	= &init_struct_pid,
+		[PIDTYPE_PGID]	= &init_struct_pid,
+		[PIDTYPE_SID]	= &init_struct_pid,
+	},
 	INIT_PREV_CPUTIME(init_signals)
 };
 
@@ -112,11 +117,7 @@ struct task_struct init_task
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
-	.pids = {
-		[PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),
-		[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),
-		[PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),
-	},
+	.thread_pid	= &init_struct_pid,
 	.thread_group	= LIST_HEAD_INIT(init_task.thread_group),
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
 #ifdef CONFIG_AUDITSYSCALL
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..d2952162399b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1549,10 +1549,22 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
 static inline void posix_cpu_timers_init(struct task_struct *tsk) { }
 #endif
 
+static inline void init_task_pid_links(struct task_struct *task)
+{
+	enum pid_type type;
+
+	for (type = PIDTYPE_PID; type < PIDTYPE_MAX; ++type) {
+		INIT_HLIST_NODE(&task->pid_links[type]);
+	}
+}
+
 static inline void
 init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
 {
-	 task->pids[type].pid = pid;
+	if (type == PIDTYPE_PID)
+		task->thread_pid = pid;
+	else
+		task->signal->pids[type] = pid;
 }
 
 static inline void rcu_copy_process(struct task_struct *p)
@@ -1928,6 +1940,7 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
+	init_task_pid_links(p);
 	if (likely(p->pid)) {
 		ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
 
@@ -2036,13 +2049,13 @@ static __latent_entropy struct task_struct *copy_process(
 	return ERR_PTR(retval);
 }
 
-static inline void init_idle_pids(struct pid_link *links)
+static inline void init_idle_pids(struct task_struct *idle)
 {
 	enum pid_type type;
 
 	for (type = PIDTYPE_PID; type < PIDTYPE_MAX; ++type) {
-		INIT_HLIST_NODE(&links[type].node); /* not really needed */
-		links[type].pid = &init_struct_pid;
+		INIT_HLIST_NODE(&idle->pid_links[type]); /* not really needed */
+		init_task_pid(idle, type, &init_struct_pid);
 	}
 }
 
@@ -2052,7 +2065,7 @@ struct task_struct *fork_idle(int cpu)
 	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
 			    cpu_to_node(cpu));
 	if (!IS_ERR(task)) {
-		init_idle_pids(task->pids);
+		init_idle_pids(task);
 		init_idle(task, cpu);
 	}
 
diff --git a/kernel/pid.c b/kernel/pid.c
index d0de2b59f86f..f8486d2e2346 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -265,27 +265,35 @@ struct pid *find_vpid(int nr)
 }
 EXPORT_SYMBOL_GPL(find_vpid);
 
+static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
+{
+	return (type == PIDTYPE_PID) ?
+		&task->thread_pid :
+		(type == __PIDTYPE_TGID) ?
+		&task->signal->leader_pid :
+		&task->signal->pids[type];
+}
+
 /*
  * attach_pid() must be called with the tasklist_lock write-held.
  */
 void attach_pid(struct task_struct *task, enum pid_type type)
 {
-	struct pid_link *link = &task->pids[type];
-	hlist_add_head_rcu(&link->node, &link->pid->tasks[type]);
+	struct pid *pid = *task_pid_ptr(task, type);
+	hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
 }
 
 static void __change_pid(struct task_struct *task, enum pid_type type,
 			struct pid *new)
 {
-	struct pid_link *link;
+	struct pid **pid_ptr = task_pid_ptr(task, type);
 	struct pid *pid;
 	int tmp;
 
-	link = &task->pids[type];
-	pid = link->pid;
+	pid = *pid_ptr;
 
-	hlist_del_rcu(&link->node);
-	link->pid = new;
+	hlist_del_rcu(&task->pid_links[type]);
+	*pid_ptr = new;
 
 	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
 		if (!hlist_empty(&pid->tasks[tmp]))
@@ -310,8 +318,9 @@ void change_pid(struct task_struct *task, enum pid_type type,
 void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)
 {
-	new->pids[type].pid = old->pids[type].pid;
-	hlist_replace_rcu(&old->pids[type].node, &new->pids[type].node);
+	if (type == PIDTYPE_PID)
+		new->thread_pid = old->thread_pid;
+	hlist_replace_rcu(&old->pid_links[type], &new->pid_links[type]);
 }
 
 struct task_struct *pid_task(struct pid *pid, enum pid_type type)
@@ -322,7 +331,7 @@ struct task_struct *pid_task(struct pid *pid, enum pid_type type)
 		first = rcu_dereference_check(hlist_first_rcu(&pid->tasks[type]),
 					      lockdep_tasklist_lock_is_held());
 		if (first)
-			result = hlist_entry(first, struct task_struct, pids[(type)].node);
+			result = hlist_entry(first, struct task_struct, pid_links[(type)]);
 	}
 	return result;
 }
@@ -360,9 +369,7 @@ struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
 {
 	struct pid *pid;
 	rcu_read_lock();
-	if (type != PIDTYPE_PID)
-		task = task->group_leader;
-	pid = get_pid(rcu_dereference(task->pids[type].pid));
+	pid = get_pid(rcu_dereference(*task_pid_ptr(task, type)));
 	rcu_read_unlock();
 	return pid;
 }
@@ -420,16 +427,8 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 	rcu_read_lock();
 	if (!ns)
 		ns = task_active_pid_ns(current);
-	if (likely(pid_alive(task))) {
-		struct pid *pid;
-		if (type == PIDTYPE_PID)
-			pid = task_pid(task);
-		else if (type == __PIDTYPE_TGID)
-			pid = task_tgid(task);
-		else
-			pid = rcu_dereference(task->group_leader->pids[type].pid);
-		nr = pid_nr_ns(pid, ns);
-	}
+	if (likely(pid_alive(task)))
+		nr = pid_nr_ns(rcu_dereference(*task_pid_ptr(task, type)), ns);
 	rcu_read_unlock();
 
 	return nr;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 06/11] pid: Implement PIDTYPE_TGID
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (4 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 05/11] pids: Move the pgrp and session pid pointers from task_struct to signal_struct Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID Eric W. Biederman
                                           ` (5 subsequent siblings)
  11 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id).  Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array.  Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.

The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.

The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest.  The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/ia64/kernel/asm-offsets.c  | 2 +-
 arch/ia64/kernel/fsys.S         | 4 ++--
 arch/s390/kernel/perf_cpum_sf.c | 2 +-
 fs/exec.c                       | 1 +
 include/linux/pid.h             | 3 +--
 include/linux/sched.h           | 4 ++--
 include/linux/sched/signal.h    | 5 ++---
 init/init_task.c                | 2 +-
 kernel/events/core.c            | 2 +-
 kernel/exit.c                   | 1 +
 kernel/fork.c                   | 3 ++-
 kernel/pid.c                    | 2 --
 kernel/time/itimer.c            | 5 +++--
 kernel/time/posix-cpu-timers.c  | 2 +-
 14 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c
index c1f8a57855af..00e8e2a1eb19 100644
--- a/arch/ia64/kernel/asm-offsets.c
+++ b/arch/ia64/kernel/asm-offsets.c
@@ -67,7 +67,7 @@ void foo(void)
 	DEFINE(IA64_SIGNAL_GROUP_STOP_COUNT_OFFSET,offsetof (struct signal_struct,
 							     group_stop_count));
 	DEFINE(IA64_SIGNAL_SHARED_PENDING_OFFSET,offsetof (struct signal_struct, shared_pending));
-	DEFINE(IA64_SIGNAL_LEADER_PID_OFFSET, offsetof (struct signal_struct, leader_pid));
+	DEFINE(IA64_SIGNAL_PIDS_TGID_OFFSET, offsetof (struct signal_struct, pids[PIDTYPE_TGID]));
 
 	BLANK();
 
diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S
index e85ebdac678b..d80c99a5f55d 100644
--- a/arch/ia64/kernel/fsys.S
+++ b/arch/ia64/kernel/fsys.S
@@ -68,10 +68,10 @@ ENTRY(fsys_getpid)
 	add r9=TI_FLAGS+IA64_TASK_SIZE,r16
 	;;
 	ld4 r9=[r9]
-	add r17=IA64_SIGNAL_LEADER_PID_OFFSET,r17
+	add r17=IA64_SIGNAL_PIDS_TGID_OFFSET,r17
 	;;
 	and r9=TIF_ALLWORK_MASK,r9
-	ld8 r17=[r17]				// r17 = current->signal->leader_pid
+	ld8 r17=[r17]				// r17 = current->signal->pids[PIDTYPE_TGID]
 	;;
 	add r8=IA64_PID_LEVEL_OFFSET,r17
 	;;
diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c
index 0292d68e7dde..ca0b7ae894bb 100644
--- a/arch/s390/kernel/perf_cpum_sf.c
+++ b/arch/s390/kernel/perf_cpum_sf.c
@@ -665,7 +665,7 @@ static void cpumsf_output_event_pid(struct perf_event *event,
 		goto out;
 
 	/* Update the process ID (see also kernel/events/core.c) */
-	data->tid_entry.pid = cpumsf_pid_type(event, pid, __PIDTYPE_TGID);
+	data->tid_entry.pid = cpumsf_pid_type(event, pid, PIDTYPE_TGID);
 	data->tid_entry.tid = cpumsf_pid_type(event, pid, PIDTYPE_PID);
 
 	perf_output_sample(&handle, &header, data, event);
diff --git a/fs/exec.c b/fs/exec.c
index 2d4e0075bd24..79a11fbded7a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1146,6 +1146,7 @@ static int de_thread(struct task_struct *tsk)
 		 */
 		tsk->pid = leader->pid;
 		change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+		transfer_pid(leader, tsk, PIDTYPE_TGID);
 		transfer_pid(leader, tsk, PIDTYPE_PGID);
 		transfer_pid(leader, tsk, PIDTYPE_SID);
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3d4c504dcc8c..14a9a39da9c7 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -7,11 +7,10 @@
 enum pid_type
 {
 	PIDTYPE_PID,
+	PIDTYPE_TGID,
 	PIDTYPE_PGID,
 	PIDTYPE_SID,
 	PIDTYPE_MAX,
-	/* only valid to __task_pid_nr_ns() */
-	__PIDTYPE_TGID
 };
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 445bdf5b1f64..06b4e3bda93a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,12 +1275,12 @@ static inline pid_t task_session_vnr(struct task_struct *tsk)
 
 static inline pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
 {
-	return __task_pid_nr_ns(tsk, __PIDTYPE_TGID, ns);
+	return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
 }
 
 static inline pid_t task_tgid_vnr(struct task_struct *tsk)
 {
-	return __task_pid_nr_ns(tsk, __PIDTYPE_TGID, NULL);
+	return __task_pid_nr_ns(tsk, PIDTYPE_TGID, NULL);
 }
 
 static inline pid_t task_ppid_nr_ns(const struct task_struct *tsk, struct pid_namespace *ns)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2dcded16eb1e..ee30a5ba475f 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -147,7 +147,6 @@ struct signal_struct {
 #endif
 
 	/* PID/PID hash table linkage. */
-	struct pid *leader_pid;
 	struct pid *pids[PIDTYPE_MAX];
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -571,7 +570,7 @@ struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
 
 static inline struct pid *task_tgid(struct task_struct *task)
 {
-	return task->signal->leader_pid;
+	return task->signal->pids[PIDTYPE_TGID];
 }
 
 /*
@@ -607,7 +606,7 @@ static inline bool thread_group_leader(struct task_struct *p)
  */
 static inline bool has_group_leader_pid(struct task_struct *p)
 {
-	return task_pid(p) == p->signal->leader_pid;
+	return task_pid(p) == task_tgid(p);
 }
 
 static inline
diff --git a/init/init_task.c b/init/init_task.c
index db12a61259f1..4f97846256d7 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -33,9 +33,9 @@ static struct signal_struct init_signals = {
 	},
 #endif
 	INIT_CPU_TIMERS(init_signals)
-	.leader_pid = &init_struct_pid,
 	.pids = {
 		[PIDTYPE_PID]	= &init_struct_pid,
+		[PIDTYPE_TGID]	= &init_struct_pid,
 		[PIDTYPE_PGID]	= &init_struct_pid,
 		[PIDTYPE_SID]	= &init_struct_pid,
 	},
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 80cca2b30c4f..9025b1796ca8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1334,7 +1334,7 @@ static u32 perf_event_pid_type(struct perf_event *event, struct task_struct *p,
 
 static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
 {
-	return perf_event_pid_type(event, p, __PIDTYPE_TGID);
+	return perf_event_pid_type(event, p, PIDTYPE_TGID);
 }
 
 static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
diff --git a/kernel/exit.c b/kernel/exit.c
index 16432428fc6c..25582b442955 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -73,6 +73,7 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 	nr_threads--;
 	detach_pid(p, PIDTYPE_PID);
 	if (group_dead) {
+		detach_pid(p, PIDTYPE_TGID);
 		detach_pid(p, PIDTYPE_PGID);
 		detach_pid(p, PIDTYPE_SID);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index d2952162399b..cc5be0d01ce6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1946,6 +1946,7 @@ static __latent_entropy struct task_struct *copy_process(
 
 		init_task_pid(p, PIDTYPE_PID, pid);
 		if (thread_group_leader(p)) {
+			init_task_pid(p, PIDTYPE_TGID, pid);
 			init_task_pid(p, PIDTYPE_PGID, task_pgrp(current));
 			init_task_pid(p, PIDTYPE_SID, task_session(current));
 
@@ -1954,7 +1955,6 @@ static __latent_entropy struct task_struct *copy_process(
 				p->signal->flags |= SIGNAL_UNKILLABLE;
 			}
 
-			p->signal->leader_pid = pid;
 			p->signal->tty = tty_kref_get(current->signal->tty);
 			/*
 			 * Inherit has_child_subreaper flag under the same
@@ -1965,6 +1965,7 @@ static __latent_entropy struct task_struct *copy_process(
 							 p->real_parent->signal->is_child_subreaper;
 			list_add_tail(&p->sibling, &p->real_parent->children);
 			list_add_tail_rcu(&p->tasks, &init_task.tasks);
+			attach_pid(p, PIDTYPE_TGID);
 			attach_pid(p, PIDTYPE_PGID);
 			attach_pid(p, PIDTYPE_SID);
 			__this_cpu_inc(process_counts);
diff --git a/kernel/pid.c b/kernel/pid.c
index f8486d2e2346..de1cfc4f75a2 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -269,8 +269,6 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
 {
 	return (type == PIDTYPE_PID) ?
 		&task->thread_pid :
-		(type == __PIDTYPE_TGID) ?
-		&task->signal->leader_pid :
 		&task->signal->pids[type];
 }
 
diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c
index f26acef5d7b4..9a65713c8309 100644
--- a/kernel/time/itimer.c
+++ b/kernel/time/itimer.c
@@ -139,9 +139,10 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
 {
 	struct signal_struct *sig =
 		container_of(timer, struct signal_struct, real_timer);
+	struct pid *leader_pid = sig->pids[PIDTYPE_TGID];
 
-	trace_itimer_expire(ITIMER_REAL, sig->leader_pid, 0);
-	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);
+	trace_itimer_expire(ITIMER_REAL, leader_pid, 0);
+	kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid);
 
 	return HRTIMER_NORESTART;
 }
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 5a6251ac6f7a..40e6fae46cec 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -895,7 +895,7 @@ static void check_cpu_itimer(struct task_struct *tsk, struct cpu_itimer *it,
 
 		trace_itimer_expire(signo == SIGPROF ?
 				    ITIMER_PROF : ITIMER_VIRTUAL,
-				    tsk->signal->leader_pid, cur_time);
+				    task_tgid(tsk), cur_time);
 		__group_send_sig_info(signo, SEND_SIG_PRIV, tsk);
 	}
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (5 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 06/11] pid: Implement PIDTYPE_TGID Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-16 12:51                           ` Oleg Nesterov
  2018-07-11  2:44                         ` [RFC][PATCH 08/11] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent Eric W. Biederman
                                           ` (4 subsequent siblings)
  11 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Now that we can make the distinction use PIDTYPE_TGID rather than
PIDTYPE_PID.  There is no immediate effect as they point point at the
same task, but this allows using enum pid_type instead of bool group
in the signal sending functions.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/signal.c            | 4 ++--
 kernel/time/posix-timers.c | 7 +++----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8d8a940422a8..7caf17d76a84 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1315,7 +1315,7 @@ int kill_pid_info(int sig, struct siginfo *info, struct pid *pid)
 
 	for (;;) {
 		rcu_read_lock();
-		p = pid_task(pid, PIDTYPE_PID);
+		p = pid_task(pid, PIDTYPE_TGID);
 		if (p)
 			error = group_send_sig_info(sig, info, p);
 		rcu_read_unlock();
@@ -1361,7 +1361,7 @@ int kill_pid_info_as_cred(int sig, struct siginfo *info, struct pid *pid,
 		return ret;
 
 	rcu_read_lock();
-	p = pid_task(pid, PIDTYPE_PID);
+	p = pid_task(pid, PIDTYPE_TGID);
 	if (!p) {
 		ret = -ESRCH;
 		goto out_unlock;
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index e08ce3f27447..d640e26d0de0 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -347,12 +347,11 @@ int posix_timer_event(struct k_itimer *timr, int si_private)
 	 */
 	timr->sigq->info.si_sys_private = si_private;
 
+	shared = !(timr->it_sigev_notify & SIGEV_THREAD_ID);
 	rcu_read_lock();
-	task = pid_task(timr->it_pid, PIDTYPE_PID);
-	if (task) {
-		shared = !(timr->it_sigev_notify & SIGEV_THREAD_ID);
+	task = pid_task(timr->it_pid, shared ? PIDTYPE_TGID : PIDTYPE_PID);
+	if (task)
 		ret = send_sigqueue(timr->sigq, task, shared);
-	}
 	rcu_read_unlock();
 	/* If we failed to send the signal the timer stops. */
 	return ret > 0;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 08/11] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (6 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-11  2:44                         ` [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK to forcibly kill tasks Eric W. Biederman
                                           ` (3 subsequent siblings)
  11 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

When f_setown is called a pid and a pid type are stored.  Replace the use
of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
group.  Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
is only for a thread.

Update the users of __f_setown to use PIDTYPE_TGID instead of
PIDTYPE_PID, and gather the pid with task_tgid.  As the intent of
PIDTYPE_PID was a thread group level owner.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/net/tun.c           |  2 +-
 drivers/tty/tty_io.c        |  4 ++--
 fs/fcntl.c                  | 20 ++++++++------------
 fs/locks.c                  |  2 +-
 fs/notify/dnotify/dnotify.c |  3 ++-
 5 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a192a017cc68..4219735a5418 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3216,7 +3216,7 @@ static int tun_chr_fasync(int fd, struct file *file, int on)
 		goto out;
 
 	if (on) {
-		__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+		__f_setown(file, task_tgid(current), PIDTYPE_TGID, 0);
 		tfile->flags |= TUN_FASYNC;
 	} else
 		tfile->flags &= ~TUN_FASYNC;
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index aba59521ad48..cec58c53b0c4 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2121,8 +2121,8 @@ static int __tty_fasync(int fd, struct file *filp, int on)
 			pid = tty->pgrp;
 			type = PIDTYPE_PGID;
 		} else {
-			pid = task_pid(current);
-			type = PIDTYPE_PID;
+			pid = task_tgid(current);
+			type = PIDTYPE_TGID;
 		}
 		get_pid(pid);
 		spin_unlock_irqrestore(&tty->ctrl_lock, flags);
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 12273b6ea56d..cf10909cfa0a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -116,7 +116,7 @@ int f_setown(struct file *filp, unsigned long arg, int force)
 	struct pid *pid = NULL;
 	int who = arg, ret = 0;
 
-	type = PIDTYPE_PID;
+	type = PIDTYPE_TGID;
 	if (who < 0) {
 		/* avoid overflow below */
 		if (who == INT_MIN)
@@ -143,7 +143,7 @@ EXPORT_SYMBOL(f_setown);
 
 void f_delown(struct file *filp)
 {
-	f_modown(filp, NULL, PIDTYPE_PID, 1);
+	f_modown(filp, NULL, PIDTYPE_TGID, 1);
 }
 
 pid_t f_getown(struct file *filp)
@@ -171,11 +171,11 @@ static int f_setown_ex(struct file *filp, unsigned long arg)
 
 	switch (owner.type) {
 	case F_OWNER_TID:
-		type = PIDTYPE_MAX;
+		type = PIDTYPE_PID;
 		break;
 
 	case F_OWNER_PID:
-		type = PIDTYPE_PID;
+		type = PIDTYPE_TGID;
 		break;
 
 	case F_OWNER_PGRP:
@@ -206,11 +206,11 @@ static int f_getown_ex(struct file *filp, unsigned long arg)
 	read_lock(&filp->f_owner.lock);
 	owner.pid = pid_vnr(filp->f_owner.pid);
 	switch (filp->f_owner.pid_type) {
-	case PIDTYPE_MAX:
+	case PIDTYPE_PID:
 		owner.type = F_OWNER_TID;
 		break;
 
-	case PIDTYPE_PID:
+	case PIDTYPE_TGID:
 		owner.type = F_OWNER_PID;
 		break;
 
@@ -785,10 +785,8 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 	read_lock(&fown->lock);
 
 	type = fown->pid_type;
-	if (type == PIDTYPE_MAX) {
+	if (type == PIDTYPE_PID)
 		group = 0;
-		type = PIDTYPE_PID;
-	}
 
 	pid = fown->pid;
 	if (!pid)
@@ -821,10 +819,8 @@ int send_sigurg(struct fown_struct *fown)
 	read_lock(&fown->lock);
 
 	type = fown->pid_type;
-	if (type == PIDTYPE_MAX) {
+	if (type == PIDTYPE_PID)
 		group = 0;
-		type = PIDTYPE_PID;
-	}
 
 	pid = fown->pid;
 	if (!pid)
diff --git a/fs/locks.c b/fs/locks.c
index db7b6917d9c5..dd5f012887ca 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -546,7 +546,7 @@ lease_setup(struct file_lock *fl, void **priv)
 	if (!fasync_insert_entry(fa->fa_fd, filp, &fl->fl_fasync, fa))
 		*priv = NULL;
 
-	__f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
+	__f_setown(filp, task_tgid(current), PIDTYPE_TGID, 0);
 }
 
 static const struct lock_manager_operations lease_manager_ops = {
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index e2bea2ac5dfb..47b650b81b07 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -19,6 +19,7 @@
 #include <linux/fs.h>
 #include <linux/module.h>
 #include <linux/sched.h>
+#include <linux/sched/signal.h>
 #include <linux/dnotify.h>
 #include <linux/init.h>
 #include <linux/spinlock.h>
@@ -353,7 +354,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 		goto out;
 	}
 
-	__f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
+	__f_setown(filp, task_tgid(current), PIDTYPE_TGID, 0);
 
 	error = attach_dn(dn, dn_mark, id, fd, filp, mask);
 	/* !error means that we attached the dn to the dn_mark, so don't free it */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK  to forcibly kill tasks
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (7 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 08/11] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-16 14:55                           ` Oleg Nesterov
  2018-07-11  2:44                         ` [RFC][PATCH 10/11] signal: Push pid type from signal senders down into __send_signal Eric W. Biederman
                                           ` (2 subsequent siblings)
  11 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

send_sig is thread only making it the wrong command for an action
         directed at a process.

force_sig does not set SEND_SIG_FORCED (causing unnecessary work in
	 __send_signal) and jumps through unnecessary hoops to deal
	 with blocked and ignored signals which SIGKILL can never be.

Therefore use do_send_sig_info in all cases in __do_SAK to kill
tasks as allows for exactly what the code wants to do.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/tty/tty_io.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index cec58c53b0c4..42ac168c2a47 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2747,7 +2747,7 @@ void __do_SAK(struct tty_struct *tty)
 	do_each_pid_task(session, PIDTYPE_SID, p) {
 		tty_notice(tty, "SAK: killed process %d (%s): by session\n",
 			   task_pid_nr(p), p->comm);
-		send_sig(SIGKILL, p, 1);
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	} while_each_pid_task(session, PIDTYPE_SID, p);
 
 	/* Now kill any processes that happen to have the tty open */
@@ -2755,7 +2755,7 @@ void __do_SAK(struct tty_struct *tty)
 		if (p->signal->tty == tty) {
 			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
 				   task_pid_nr(p), p->comm);
-			send_sig(SIGKILL, p, 1);
+			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 			continue;
 		}
 		task_lock(p);
@@ -2763,7 +2763,7 @@ void __do_SAK(struct tty_struct *tty)
 		if (i != 0) {
 			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
 				   task_pid_nr(p), p->comm, i - 1);
-			force_sig(SIGKILL, p);
+			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 		}
 		task_unlock(p);
 	} while_each_thread(g, p);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 10/11] signal: Push pid type from signal senders down into __send_signal
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (8 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK to forcibly kill tasks Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-11  3:11                           ` Linus Torvalds
  2018-07-11  2:44                         ` [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork Eric W. Biederman
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
  11 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Use the information we already have to document which signals are sent
to a group of processes and which signals are sent to a single process
or a single thread.  This information will be needed later to ensure
signals are atomic with respect to fork (always coming in before or
after the system call), and so that fork doesn't restart when
unnecessary.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/tty/sysrq.c           |  2 +-
 drivers/tty/tty_io.c          |  6 ++--
 fs/fcntl.c                    | 22 +++++---------
 include/linux/sched/signal.h  |  2 +-
 include/linux/signal.h        |  6 ++--
 include/trace/events/signal.h | 12 ++++----
 kernel/exit.c                 |  3 +-
 kernel/signal.c               | 55 +++++++++++++++++++----------------
 kernel/time/posix-timers.c    | 12 +++-----
 mm/oom_kill.c                 |  4 +--
 10 files changed, 60 insertions(+), 64 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 6364890575ec..06ed20dd01ba 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -348,7 +348,7 @@ static void send_sig_all(int sig)
 		if (is_global_init(p))
 			continue;
 
-		do_send_sig_info(sig, SEND_SIG_FORCED, p, true);
+		do_send_sig_info(sig, SEND_SIG_FORCED, p, PIDTYPE_MAX);
 	}
 	read_unlock(&tasklist_lock);
 }
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 42ac168c2a47..c8b4cfaceed1 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2747,7 +2747,7 @@ void __do_SAK(struct tty_struct *tty)
 	do_each_pid_task(session, PIDTYPE_SID, p) {
 		tty_notice(tty, "SAK: killed process %d (%s): by session\n",
 			   task_pid_nr(p), p->comm);
-		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_SID);
 	} while_each_pid_task(session, PIDTYPE_SID, p);
 
 	/* Now kill any processes that happen to have the tty open */
@@ -2755,7 +2755,7 @@ void __do_SAK(struct tty_struct *tty)
 		if (p->signal->tty == tty) {
 			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
 				   task_pid_nr(p), p->comm);
-			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_SID);
 			continue;
 		}
 		task_lock(p);
@@ -2763,7 +2763,7 @@ void __do_SAK(struct tty_struct *tty)
 		if (i != 0) {
 			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
 				   task_pid_nr(p), p->comm, i - 1);
-			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_SID);
 		}
 		task_unlock(p);
 	} while_each_thread(g, p);
diff --git a/fs/fcntl.c b/fs/fcntl.c
index cf10909cfa0a..21927688cbcf 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -723,7 +723,7 @@ static inline int sigio_perm(struct task_struct *p,
 
 static void send_sigio_to_task(struct task_struct *p,
 			       struct fown_struct *fown,
-			       int fd, int reason, int group)
+			       int fd, int reason, enum pid_type type)
 {
 	/*
 	 * F_SETSIG can change ->signum lockless in parallel, make
@@ -767,11 +767,11 @@ static void send_sigio_to_task(struct task_struct *p,
 			else
 				si.si_band = mangle_poll(band_table[reason - POLL_IN]);
 			si.si_fd    = fd;
-			if (!do_send_sig_info(signum, &si, p, group))
+			if (!do_send_sig_info(signum, &si, p, type))
 				break;
 		/* fall-through: fall back on the old plain SIGIO signal */
 		case 0:
-			do_send_sig_info(SIGIO, SEND_SIG_PRIV, p, group);
+			do_send_sig_info(SIGIO, SEND_SIG_PRIV, p, type);
 	}
 }
 
@@ -780,21 +780,17 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 	struct task_struct *p;
 	enum pid_type type;
 	struct pid *pid;
-	int group = 1;
 	
 	read_lock(&fown->lock);
 
 	type = fown->pid_type;
-	if (type == PIDTYPE_PID)
-		group = 0;
-
 	pid = fown->pid;
 	if (!pid)
 		goto out_unlock_fown;
 	
 	read_lock(&tasklist_lock);
 	do_each_pid_task(pid, type, p) {
-		send_sigio_to_task(p, fown, fd, band, group);
+		send_sigio_to_task(p, fown, fd, band, type);
 	} while_each_pid_task(pid, type, p);
 	read_unlock(&tasklist_lock);
  out_unlock_fown:
@@ -802,10 +798,10 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 }
 
 static void send_sigurg_to_task(struct task_struct *p,
-				struct fown_struct *fown, int group)
+				struct fown_struct *fown, enum pid_type type)
 {
 	if (sigio_perm(p, fown, SIGURG))
-		do_send_sig_info(SIGURG, SEND_SIG_PRIV, p, group);
+		do_send_sig_info(SIGURG, SEND_SIG_PRIV, p, type);
 }
 
 int send_sigurg(struct fown_struct *fown)
@@ -813,15 +809,11 @@ int send_sigurg(struct fown_struct *fown)
 	struct task_struct *p;
 	enum pid_type type;
 	struct pid *pid;
-	int group = 1;
 	int ret = 0;
 	
 	read_lock(&fown->lock);
 
 	type = fown->pid_type;
-	if (type == PIDTYPE_PID)
-		group = 0;
-
 	pid = fown->pid;
 	if (!pid)
 		goto out_unlock_fown;
@@ -830,7 +822,7 @@ int send_sigurg(struct fown_struct *fown)
 	
 	read_lock(&tasklist_lock);
 	do_each_pid_task(pid, type, p) {
-		send_sigurg_to_task(p, fown, group);
+		send_sigurg_to_task(p, fown, type);
 	} while_each_pid_task(pid, type, p);
 	read_unlock(&tasklist_lock);
  out_unlock_fown:
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index ee30a5ba475f..e99cd53cbd80 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -330,7 +330,7 @@ extern int send_sig(int, struct task_struct *, int);
 extern int zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
-extern int send_sigqueue(struct sigqueue *,  struct task_struct *, int group);
+extern int send_sigqueue(struct sigqueue *,  struct pid *, enum pid_type);
 extern int do_sigaction(int, struct k_sigaction *, struct k_sigaction *);
 
 static inline int restart_syscall(void)
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 3c5200137b24..fe125b0335f7 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -254,11 +254,13 @@ static inline int valid_signal(unsigned long sig)
 
 struct timespec;
 struct pt_regs;
+enum pid_type;
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
 extern int do_send_sig_info(int sig, struct siginfo *info,
-				struct task_struct *p, bool group);
-extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p);
+				struct task_struct *p, enum pid_type type);
+extern int group_send_sig_info(int sig, struct siginfo *info,
+			       struct task_struct *p, enum pid_type type);
 extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
 extern void set_current_blocked(sigset_t *);
diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index 86582923d51c..e3901aee9b0f 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -39,7 +39,7 @@ enum {
  * @sig: signal number
  * @info: pointer to struct siginfo
  * @task: pointer to struct task_struct
- * @group: shared or private
+ * @type: kind of signal generated
  * @result: TRACE_SIGNAL_*
  *
  * Current process sends a 'sig' signal to 'task' process with
@@ -51,9 +51,9 @@ enum {
 TRACE_EVENT(signal_generate,
 
 	TP_PROTO(int sig, struct siginfo *info, struct task_struct *task,
-			int group, int result),
+			enum pid_type type, int result),
 
-	TP_ARGS(sig, info, task, group, result),
+	TP_ARGS(sig, info, task, type, result),
 
 	TP_STRUCT__entry(
 		__field(	int,	sig			)
@@ -61,7 +61,7 @@ TRACE_EVENT(signal_generate,
 		__field(	int,	code			)
 		__array(	char,	comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	pid			)
-		__field(	int,	group			)
+		__field(	enum pid_type,	type		)
 		__field(	int,	result			)
 	),
 
@@ -70,13 +70,13 @@ TRACE_EVENT(signal_generate,
 		TP_STORE_SIGINFO(__entry, info);
 		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
 		__entry->pid	= task->pid;
-		__entry->group	= group;
+		__entry->type	= type;
 		__entry->result	= result;
 	),
 
 	TP_printk("sig=%d errno=%d code=%d comm=%s pid=%d grp=%d res=%d",
 		  __entry->sig, __entry->errno, __entry->code,
-		  __entry->comm, __entry->pid, __entry->group,
+		  __entry->comm, __entry->pid, __entry->type != PIDTYPE_PID,
 		  __entry->result)
 );
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 25582b442955..0e21e6d21f35 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -681,7 +681,8 @@ static void forget_original_parent(struct task_struct *father,
 				t->parent = t->real_parent;
 			if (t->pdeath_signal)
 				group_send_sig_info(t->pdeath_signal,
-						    SEND_SIG_NOINFO, t);
+						    SEND_SIG_NOINFO, t,
+						    PIDTYPE_TGID);
 		}
 		/*
 		 * If this is a threaded reparent there is no need to
diff --git a/kernel/signal.c b/kernel/signal.c
index 7caf17d76a84..2cf4ddc8e3a3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -895,7 +895,7 @@ static inline int wants_signal(int sig, struct task_struct *p)
 	return task_curr(p) || !signal_pending(p);
 }
 
-static void complete_signal(int sig, struct task_struct *p, int group)
+static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 {
 	struct signal_struct *signal = p->signal;
 	struct task_struct *t;
@@ -908,7 +908,7 @@ static void complete_signal(int sig, struct task_struct *p, int group)
 	 */
 	if (wants_signal(sig, p))
 		t = p;
-	else if (!group || thread_group_empty(p))
+	else if ((type == PIDTYPE_PID) || thread_group_empty(p))
 		/*
 		 * There is just one thread and it does not need to be woken.
 		 * It will dequeue unblocked signals before it runs again.
@@ -998,7 +998,7 @@ static inline void userns_fixup_signal_uid(struct siginfo *info, struct task_str
 #endif
 
 static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
-			int group, int from_ancestor_ns)
+			 enum pid_type type, int from_ancestor_ns)
 {
 	struct sigpending *pending;
 	struct sigqueue *q;
@@ -1012,7 +1012,7 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			from_ancestor_ns || (info == SEND_SIG_FORCED)))
 		goto ret;
 
-	pending = group ? &t->signal->shared_pending : &t->pending;
+	pending = type != PIDTYPE_PID ? &t->signal->shared_pending : &t->pending;
 	/*
 	 * Short-circuit ignored signals and support queuing
 	 * exactly one non-rt signal, so that we can get more
@@ -1096,14 +1096,14 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
-	complete_signal(sig, t, group);
+	complete_signal(sig, t, type);
 ret:
-	trace_signal_generate(sig, info, t, group, result);
+	trace_signal_generate(sig, info, t, type, result);
 	return ret;
 }
 
 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
-			int group)
+			enum pid_type type)
 {
 	int from_ancestor_ns = 0;
 
@@ -1112,7 +1112,7 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			   !task_pid_nr_ns(current, task_active_pid_ns(t));
 #endif
 
-	return __send_signal(sig, info, t, group, from_ancestor_ns);
+	return __send_signal(sig, info, t, type, from_ancestor_ns);
 }
 
 static void print_fatal_signal(int signr)
@@ -1151,23 +1151,23 @@ __setup("print-fatal-signals=", setup_print_fatal_signals);
 int
 __group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 {
-	return send_signal(sig, info, p, 1);
+	return send_signal(sig, info, p, PIDTYPE_TGID);
 }
 
 static int
 specific_send_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 {
-	return send_signal(sig, info, t, 0);
+	return send_signal(sig, info, t, PIDTYPE_PID);
 }
 
 int do_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
-			bool group)
+			enum pid_type type)
 {
 	unsigned long flags;
 	int ret = -ESRCH;
 
 	if (lock_task_sighand(p, &flags)) {
-		ret = send_signal(sig, info, p, group);
+		ret = send_signal(sig, info, p, type);
 		unlock_task_sighand(p, &flags);
 	}
 
@@ -1274,7 +1274,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
 /*
  * send signal info to all the members of a group
  */
-int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
+int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
+	enum pid_type type)
 {
 	int ret;
 
@@ -1283,7 +1284,7 @@ int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	rcu_read_unlock();
 
 	if (!ret && sig)
-		ret = do_send_sig_info(sig, info, p, true);
+		ret = do_send_sig_info(sig, info, p, type);
 
 	return ret;
 }
@@ -1301,7 +1302,7 @@ int __kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp)
 	success = 0;
 	retval = -ESRCH;
 	do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
-		int err = group_send_sig_info(sig, info, p);
+		int err = group_send_sig_info(sig, info, p, PIDTYPE_PGID);
 		success |= !err;
 		retval = err;
 	} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
@@ -1317,7 +1318,7 @@ int kill_pid_info(int sig, struct siginfo *info, struct pid *pid)
 		rcu_read_lock();
 		p = pid_task(pid, PIDTYPE_TGID);
 		if (p)
-			error = group_send_sig_info(sig, info, p);
+			error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
 		rcu_read_unlock();
 		if (likely(!p || error != -ESRCH))
 			return error;
@@ -1376,7 +1377,7 @@ int kill_pid_info_as_cred(int sig, struct siginfo *info, struct pid *pid,
 
 	if (sig) {
 		if (lock_task_sighand(p, &flags)) {
-			ret = __send_signal(sig, info, p, 1, 0);
+			ret = __send_signal(sig, info, p, PIDTYPE_TGID, 0);
 			unlock_task_sighand(p, &flags);
 		} else
 			ret = -ESRCH;
@@ -1420,7 +1421,7 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
 		for_each_process(p) {
 			if (task_pid_vnr(p) > 1 &&
 					!same_thread_group(p, current)) {
-				int err = group_send_sig_info(sig, info, p);
+				int err = group_send_sig_info(sig, info, p, PIDTYPE_MAX);
 				++count;
 				if (err != -EPERM)
 					retval = err;
@@ -1446,7 +1447,7 @@ int send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	if (!valid_signal(sig))
 		return -EINVAL;
 
-	return do_send_sig_info(sig, info, p, false);
+	return do_send_sig_info(sig, info, p, PIDTYPE_PID);
 }
 
 #define __si_special(priv) \
@@ -1664,15 +1665,18 @@ void sigqueue_free(struct sigqueue *q)
 		__sigqueue_free(q);
 }
 
-int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
+int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
 {
 	int sig = q->info.si_signo;
 	struct sigpending *pending;
+	struct task_struct *t;
 	unsigned long flags;
 	int ret, result;
 
 	BUG_ON(!(q->flags & SIGQUEUE_PREALLOC));
 
+	rcu_read_lock();
+	t = pid_task(pid, type);
 	ret = -1;
 	if (!likely(lock_task_sighand(t, &flags)))
 		goto ret;
@@ -1696,15 +1700,16 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
 	q->info.si_overrun = 0;
 
 	signalfd_notify(t, sig);
-	pending = group ? &t->signal->shared_pending : &t->pending;
+	pending = type != PIDTYPE_PID ? &t->signal->shared_pending : &t->pending;
 	list_add_tail(&q->list, &pending->list);
 	sigaddset(&pending->signal, sig);
-	complete_signal(sig, t, group);
+	complete_signal(sig, t, type);
 	result = TRACE_SIGNAL_DELIVERED;
 out:
-	trace_signal_generate(sig, &q->info, t, group, result);
+	trace_signal_generate(sig, &q->info, t, type, result);
 	unlock_task_sighand(t, &flags);
 ret:
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -3193,7 +3198,7 @@ do_send_specific(pid_t tgid, pid_t pid, int sig, struct siginfo *info)
 		 * probe.  No signal is actually delivered.
 		 */
 		if (!error && sig) {
-			error = do_send_sig_info(sig, info, p, false);
+			error = do_send_sig_info(sig, info, p, PIDTYPE_PID);
 			/*
 			 * If lock_task_sighand() failed we pretend the task
 			 * dies after receiving the signal. The window is tiny,
@@ -3960,7 +3965,7 @@ void kdb_send_sig(struct task_struct *t, int sig)
 			   "the deadlock.\n");
 		return;
 	}
-	ret = send_signal(sig, SEND_SIG_PRIV, t, false);
+	ret = send_signal(sig, SEND_SIG_PRIV, t, PIDTYPE_PID);
 	spin_unlock(&t->sighand->siglock);
 	if (ret)
 		kdb_printf("Fail to deliver Signal %d to process %d.\n",
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index d640e26d0de0..f0c8d98e9eff 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -332,8 +332,8 @@ void posixtimer_rearm(struct siginfo *info)
 
 int posix_timer_event(struct k_itimer *timr, int si_private)
 {
-	struct task_struct *task;
-	int shared, ret = -1;
+	enum pid_type type;
+	int ret = -1;
 	/*
 	 * FIXME: if ->sigq is queued we can race with
 	 * dequeue_signal()->posixtimer_rearm().
@@ -347,12 +347,8 @@ int posix_timer_event(struct k_itimer *timr, int si_private)
 	 */
 	timr->sigq->info.si_sys_private = si_private;
 
-	shared = !(timr->it_sigev_notify & SIGEV_THREAD_ID);
-	rcu_read_lock();
-	task = pid_task(timr->it_pid, shared ? PIDTYPE_TGID : PIDTYPE_PID);
-	if (task)
-		ret = send_sigqueue(timr->sigq, task, shared);
-	rcu_read_unlock();
+	type = !(timr->it_sigev_notify & SIGEV_THREAD_ID) ? PIDTYPE_TGID : PIDTYPE_PID;
+	ret = send_sigqueue(timr->sigq, timr->it_pid, type);
 	/* If we failed to send the signal the timer stops. */
 	return ret > 0;
 }
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..2cc9b238368f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -920,7 +920,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 * in order to prevent the OOM victim from depleting the memory
 	 * reserves from the user space under its control.
 	 */
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, PIDTYPE_TGID);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -958,7 +958,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		 */
 		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
 	}
 	rcu_read_unlock();
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork.
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (9 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 10/11] signal: Push pid type from signal senders down into __send_signal Eric W. Biederman
@ 2018-07-11  2:44                         ` Eric W. Biederman
  2018-07-11 14:14                           ` Oleg Nesterov
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
  11 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11  2:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.

The code was being overly pesimistic.  Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child.  For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().

While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread.  Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.

Add a sequence counter to signal_struct that notes when a signal sent
to multiple processes has been received.  If that sequence counter is
incremented during fork, restart the fork process and let the signals
sent to multiple processes be received before the fork.

The sigaction of a child of fork is initially the same as the
sigaction of the parent process.  So a signal the parent ignores the
child will also initially ignore.  Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.

Ensure all multiple processes signals received during the fork are
processed before the fork by restarting the fork.  Forwarding signals
sent to multiple processes to the new child appears to be a lot of tricky
work that we will never test so it is not currently implemented.

Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with.  They won't cause any problems.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  1 +
 kernel/fork.c                | 19 +++++++++++++++++--
 kernel/signal.c              |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index e99cd53cbd80..f557b23f1d60 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -89,6 +89,7 @@ struct signal_struct {
 
 	/* shared signal handling: */
 	struct sigpending	shared_pending;
+	seqcount_t		multi_process_seq;
 
 	/* thread group exit support */
 	int			group_exit_code;
diff --git a/kernel/fork.c b/kernel/fork.c
index cc5be0d01ce6..f2e1df5c8189 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1456,6 +1456,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	init_waitqueue_head(&sig->wait_chldexit);
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
+	seqcount_init(&sig->multi_process_seq);
 	seqlock_init(&sig->stats_lock);
 	prev_cputime_init(&sig->prev_cputime);
 
@@ -1602,6 +1603,20 @@ static __latent_entropy struct task_struct *copy_process(
 {
 	int retval;
 	struct task_struct *p;
+	unsigned seq;
+
+	/*
+	 * Signals that are delivered to multiple processes need to be
+	 * delivered to just the parent before the fork or both the
+	 * parent and the child after the fork.  Cache the multiple
+	 * process signal sequence number so we can detect any of
+	 * these signals that happen during the fork.  In the unlikely
+	 * event a signal comes in while fork is starting and restart
+	 * fork to handle the signal.
+	 */
+	seq = read_seqcount_begin(&current->signal->multi_process_seq);
+	if (signal_pending(current))
+		return ERR_PTR(-ERESTARTNOINTR);
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1930,8 +1945,8 @@ static __latent_entropy struct task_struct *copy_process(
 	 * A fatal signal pending means that current will exit, so the new
 	 * thread can't slip out of an OOM kill (or normal SIGKILL).
 	*/
-	recalc_sigpending();
-	if (signal_pending(current)) {
+	if (read_seqcount_retry(&current->signal->multi_process_seq, seq) ||
+	    fatal_signal_pending(current)) {
 		retval = -ERESTARTNOINTR;
 		goto bad_fork_cancel_cgroup;
 	}
diff --git a/kernel/signal.c b/kernel/signal.c
index 2cf4ddc8e3a3..515275b3f68f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1096,6 +1096,8 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
+	if (type > PIDTYPE_TGID)
+		write_seqcount_invalidate(&t->signal->multi_process_seq);
 	complete_signal(sig, t, type);
 ret:
 	trace_signal_generate(sig, info, t, type, result);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 10/11] signal: Push pid type from signal senders down into __send_signal
  2018-07-11  2:44                         ` [RFC][PATCH 10/11] signal: Push pid type from signal senders down into __send_signal Eric W. Biederman
@ 2018-07-11  3:11                           ` Linus Torvalds
  0 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2018-07-11  3:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang



On Tue, 10 Jul 2018, Eric W. Biederman wrote:
>
> Use the information we already have to document which signals are sent
> to a group of processes and which signals are sent to a single process
> or a single thread.

Ahh.

This is much nicer than what I was playing with yesterday, trying to 
separate out the "bool group" logic in the signal sending code.

I didn't even think to use the pidtype. 

In my defense, I would never have done this whole pidtype cleanup that 
preceded this patch just to fix that odd fork() thing.

As I started reading this patch series, I went from "this seems a bit 
pointless" to "Ahhh...." and as I did that I started liking the series a 
lot more.

My initial reaction was "this seems over-engineered" when I just looked at 
the subject lines in my mailbox.

But as I progressed through the series, I really appreciated it. And this 
"10/11" was when I went "ok, I don't even need to see patch 11, I know 
what he's doing.

Anyway, take that as a long-winded ack for the approach and the 
appreciation of the series.

Of course, that's just reading through the patches, no actual _testing_ of 
them. But it looks good to me.

Thanks,

                Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [Bug 200447] infinite loop in fork syscall
  2018-07-10 16:00                     ` [Bug 200447] infinite loop in fork syscall Eric W. Biederman
@ 2018-07-11 12:08                       ` Oleg Nesterov
  0 siblings, 0 replies; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-11 12:08 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linus Torvalds, Andrew Morton, linux-kernel

On 07/10, Eric W. Biederman wrote:
>
> > 2. To simplify, lets suppose we add the new PF_INFORK flag. Yes, this is bad,
> >    we can do better. I think we can simply add "struct hlist_head forking_threads"
> >    into signal_struct, so complete_signal() can just do hlist_for_each_entry()
> >    rather than for_each_thread() + PF_INFORK check. We don't even need a new
> >    member in task_struct.
>
> We still need the distinction between multi-process signals and single
> process signals (which is the hard part).  For good performance of
> signal delivery to multi-threaded tasks we still need a new member in
> signal_struct.  Plus it is a bit more work to update the list or even
> walk the list than a sequence counter.
>
> So I think adding a sequence counter to let us know about multiprocess
> signals is the local optimum.

But we can not rely on on a sequence counter, there are other reasons why
fork() should fail even if fatal_signal_pending() == F and the counter was
not changed (no multi-process signals).

> > 3. copy_process() can simply block/unblock all signals (except KILL/STOP), see
> >    the "patch" below.
>
> All signals are effectively blocked for the duration of the fork for the
> calling task.    Where we get into trouble and where we need a fix for
> correctness is that another thread can dequeue the signal.   Blocking
> signals of the forking task does not change that.

See my reply to Linus. Please look at the change in complete_signal().

> I think that reveals another bug in our current logic.  For blocked
> multi-process signals we don't ensure they get delivered to both the
> parent and the child if the signal logically comes in after the fork.

I thougth thought this too. I simply do not know if this is right or not.

For now I assume that this is correct and by design, iow if fork() is called
with (say) SIGTERM blocked, then we do not care if kill_pgrp(SIGTERM) misses
the new child.

If we want to change this, I think this needs another discussion.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork.
  2018-07-11  2:44                         ` [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork Eric W. Biederman
@ 2018-07-11 14:14                           ` Oleg Nesterov
  2018-07-11 16:02                             ` Eric W. Biederman
  2018-07-13 14:51                             ` Eric W. Biederman
  0 siblings, 2 replies; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-11 14:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

On 07/10, Eric W. Biederman wrote:
>
> @@ -1602,6 +1603,20 @@ static __latent_entropy struct task_struct *copy_process(
>  {
>  	int retval;
>  	struct task_struct *p;
> +	unsigned seq;
> +
> +	/*
> +	 * Signals that are delivered to multiple processes need to be
> +	 * delivered to just the parent before the fork or both the
> +	 * parent and the child after the fork.  Cache the multiple
> +	 * process signal sequence number so we can detect any of
> +	 * these signals that happen during the fork.  In the unlikely
> +	 * event a signal comes in while fork is starting and restart
> +	 * fork to handle the signal.
> +	 */
> +	seq = read_seqcount_begin(&current->signal->multi_process_seq);
> +	if (signal_pending(current))
> +		return ERR_PTR(-ERESTARTNOINTR);
>
>  	/*
>  	 * Don't allow sharing the root directory with processes in a different
> @@ -1930,8 +1945,8 @@ static __latent_entropy struct task_struct *copy_process(
>  	 * A fatal signal pending means that current will exit, so the new
>  	 * thread can't slip out of an OOM kill (or normal SIGKILL).
>  	*/
> -	recalc_sigpending();
> -	if (signal_pending(current)) {
> +	if (read_seqcount_retry(&current->signal->multi_process_seq, seq) ||
> +	    fatal_signal_pending(current)) {
>  		retval = -ERESTARTNOINTR;
>  		goto bad_fork_cancel_cgroup;

So once again, I think this is not right, see the discussion on bugzilla.

If signal_pending() == T we simply can't know if copy_process() can succeed or not.
I have already mentioned the races with stop/freeze, but I think there are more.

And in fact I think that the fact that signal_wake_up() helps to avoid the races
with fork() is useful. Say, we could add signal_wake_up() into syscall_regfunc()
and kill syscall_tracepoint_update(). Not that I think this particular change makes
any sense, but it can work.



That is why I tried to sugest another approach. copy_process() should always fail
if signal_pending() == T, just the "real" signal should not disturb the forking
thread unless the signal is fatal or multi-process.

This also makes another difference in multi-threaded case, a signal with a handler
sent to a forking process will be re-targeted to another thread which can handle it;
with your patch this signal will be "blocked" until fork() finishes or until another
thread gets TIF_SIGPENDING. Not that I think this is that important, but still.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork.
  2018-07-11 14:14                           ` Oleg Nesterov
@ 2018-07-11 16:02                             ` Eric W. Biederman
  2018-07-12 13:42                               ` Oleg Nesterov
  2018-07-13 14:51                             ` Eric W. Biederman
  1 sibling, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-11 16:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/10, Eric W. Biederman wrote:
>>
>> @@ -1602,6 +1603,20 @@ static __latent_entropy struct task_struct *copy_process(
>>  {
>>  	int retval;
>>  	struct task_struct *p;
>> +	unsigned seq;
>> +
>> +	/*
>> +	 * Signals that are delivered to multiple processes need to be
>> +	 * delivered to just the parent before the fork or both the
>> +	 * parent and the child after the fork.  Cache the multiple
>> +	 * process signal sequence number so we can detect any of
>> +	 * these signals that happen during the fork.  In the unlikely
>> +	 * event a signal comes in while fork is starting and restart
>> +	 * fork to handle the signal.
>> +	 */
>> +	seq = read_seqcount_begin(&current->signal->multi_process_seq);
>> +	if (signal_pending(current))
>> +		return ERR_PTR(-ERESTARTNOINTR);
>>
>>  	/*
>>  	 * Don't allow sharing the root directory with processes in a different
>> @@ -1930,8 +1945,8 @@ static __latent_entropy struct task_struct *copy_process(
>>  	 * A fatal signal pending means that current will exit, so the new
>>  	 * thread can't slip out of an OOM kill (or normal SIGKILL).
>>  	*/
>> -	recalc_sigpending();
>> -	if (signal_pending(current)) {
>> +	if (read_seqcount_retry(&current->signal->multi_process_seq, seq) ||
>> +	    fatal_signal_pending(current)) {
>>  		retval = -ERESTARTNOINTR;
>>  		goto bad_fork_cancel_cgroup;
>
> So once again, I think this is not right, see the discussion on
> bugzilla.

I am trying to dig through and understand your concerns.  I am having
difficulty understanding your concerns.

Do the previous patches look good to you?  Can we say that is a
sufficient method to get the information about signals that are sent to
multiple processes into __send_signal?

> If signal_pending() == T we simply can't know if copy_process() can succeed or not.
> I have already mentioned the races with stop/freeze, but I think there
> are more.

If I understand you correctly.  Your concern is that since we added the:

	recalc_sigpending();
        if (signal_pending(current))
        	return -ERESTARTNOINTR;

Other (non-signal) code such as the freezer has come to depend upon that
test.  Changing the test in the proposed way will allow the new child to
escape the freezer, as it is not guaranteed the new child will be
frozen.

It seems reasonable to look at other things that set TIF_SIGPENDING and
see if any of them are broken by the fork changes.

A quick look at exit_to_usermode_loop shows that TIF_SIGPENDING only
triggers signal handling.  In get_signal there is only task_work_run,
try_to_freeze, and burried there is ptrace_stop. 

Plus there is restart_syscall() that sets TIF_SIGPENDING.  Now that we
aren't guaranteed that TIF_SIGPENDING is set before we restart, the code
should be using "retval = restart_syscall();"  I will fix that.


I will dig in and see what attention those cases need in fork
signal_pending behavior.  I am hoping that it will be as simple as
adding:

        /* Have the child return to userspace slowly
	 * TIF_SIGPENDING was set during fork
         */
	if (test_tsk_thread_flag(current, TIF_SIGPENDING))
		set_tsk_thread_flag(p, TIF_SIGPENDING);
        	

> And in fact I think that the fact that signal_wake_up() helps to avoid the races
> with fork() is useful. Say, we could add signal_wake_up() into syscall_regfunc()
> and kill syscall_tracepoint_update(). Not that I think this particular change makes
> any sense, but it can work.
>
> That is why I tried to sugest another approach. copy_process() should always fail
> if signal_pending() == T, just the "real" signal should not disturb the forking
> thread unless the signal is fatal or multi-process.

So after seeing the report of periodic timers causing a 40ms fork to
stretch into a 1000ms fork because of restarts, I am not a fan of cases
where fork has to restart.  40ms is a lot of work to abandon.

A practical (and fixable) problem with your patch was that you modified
task->blocked which was then copied to the child.  So all children now
would start with all signals being blocked.

> This also makes another difference in multi-threaded case, a signal with a handler
> sent to a forking process will be re-targeted to another thread which can handle it;
> with your patch this signal will be "blocked" until fork() finishes or until another
> thread gets TIF_SIGPENDING. Not that I think this is that important,
> but still.

I would not object to wants_signal deciding that a task in the middle of
copy_process does not want signals.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork.
  2018-07-11 16:02                             ` Eric W. Biederman
@ 2018-07-12 13:42                               ` Oleg Nesterov
  2018-07-12 17:11                                 ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-12 13:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

On 07/11, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >> -	recalc_sigpending();
> >> -	if (signal_pending(current)) {
> >> +	if (read_seqcount_retry(&current->signal->multi_process_seq, seq) ||
> >> +	    fatal_signal_pending(current)) {
> >>  		retval = -ERESTARTNOINTR;
> >>  		goto bad_fork_cancel_cgroup;
> >
> > So once again, I think this is not right, see the discussion on
> > bugzilla.
>
> I am trying to dig through and understand your concerns.  I am having
> difficulty understanding your concerns.
>
> Do the previous patches look good to you?

Yes, yes, personally I like 1-10 after a quick glance. I'll try to read this
series carefully later, but I don't think I will find something really wrong.

> If I understand you correctly.  Your concern is that since we added the:
>
> 	recalc_sigpending();
>         if (signal_pending(current))
>         	return -ERESTARTNOINTR;
>
> Other (non-signal) code such as the freezer has come to depend upon that
> test.  Changing the test in the proposed way will allow the new child to
> escape the freezer, as it is not guaranteed the new child will be
> frozen.

Yes.

>
> It seems reasonable to look at other things that set TIF_SIGPENDING and
> see if any of them are broken by the fork changes.

Again, please look at do_signal_stop(). If it was the source of signal_pending(),
copy_process() should fail. Or we should update the new thread to participate in
group-stop, but then we need to set TIF_SIGPENDING, copy the relevant part of
current->jobctl, and increment ->group_stop_count at least.

> A practical (and fixable) problem with your patch was that you modified
> task->blocked which was then copied to the child.  So all children now
> would start with all signals being blocked.

What are you talking about, this pseudo-code has a lot more bugs ;)

OK, at least I certainly agree that this approach needs more changes in copy_process().

> > This also makes another difference in multi-threaded case, a signal with a handler
> > sent to a forking process will be re-targeted to another thread which can handle it;
> > with your patch this signal will be "blocked" until fork() finishes or until another
> > thread gets TIF_SIGPENDING. Not that I think this is that important,
> > but still.
>
> I would not object to wants_signal deciding that a task in the middle of
> copy_process does not want signals.

This is not enough, we need to signal all in-fork threads...

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork.
  2018-07-12 13:42                               ` Oleg Nesterov
@ 2018-07-12 17:11                                 ` Eric W. Biederman
  0 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-12 17:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/11, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> >> -	recalc_sigpending();
>> >> -	if (signal_pending(current)) {
>> >> +	if (read_seqcount_retry(&current->signal->multi_process_seq, seq) ||
>> >> +	    fatal_signal_pending(current)) {
>> >>  		retval = -ERESTARTNOINTR;
>> >>  		goto bad_fork_cancel_cgroup;
>> >
>> > So once again, I think this is not right, see the discussion on
>> > bugzilla.
>>
>> I am trying to dig through and understand your concerns.  I am having
>> difficulty understanding your concerns.
>>
>> Do the previous patches look good to you?
>
> Yes, yes, personally I like 1-10 after a quick glance. I'll try to read this
> series carefully later, but I don't think I will find something really
> wrong.

Good.  Then I will consider those acked by both you and Linus.

Oleg do you mind if I add:
Acked-by: Oleg Nesterov <oleg@redhat.com>

To those patches?

>> If I understand you correctly.  Your concern is that since we added the:
>>
>> 	recalc_sigpending();
>>         if (signal_pending(current))
>>         	return -ERESTARTNOINTR;
>>
>> Other (non-signal) code such as the freezer has come to depend upon that
>> test.  Changing the test in the proposed way will allow the new child to
>> escape the freezer, as it is not guaranteed the new child will be
>> frozen.
>
> Yes.


>> It seems reasonable to look at other things that set TIF_SIGPENDING and
>> see if any of them are broken by the fork changes.
>
> Again, please look at do_signal_stop(). If it was the source of signal_pending(),
> copy_process() should fail. Or we should update the new thread to participate in
> group-stop, but then we need to set TIF_SIGPENDING, copy the relevant part of
> current->jobctl, and increment ->group_stop_count at least.

Hmm.  That is an interesting twist.

In general for do_signal_stop is fine as long as we have the
recalc_sigpending at the start of fork.

But yes.  What happens when it isn't a fork but it is a clone.  Signals
that affect the entire thread group (STOP CLONE) are very interesting
from this perspective.

Same issue as with fork, but different scope.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork.
  2018-07-11 14:14                           ` Oleg Nesterov
  2018-07-11 16:02                             ` Eric W. Biederman
@ 2018-07-13 14:51                             ` Eric W. Biederman
  1 sibling, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-13 14:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> That is why I tried to sugest another approach. copy_process() should always fail
> if signal_pending() == T, just the "real" signal should not disturb the forking
> thread unless the signal is fatal or multi-process.

I understand now why you are suggesting another approach.  There are lot
of cases that could be affected by the removal of
"if (signal_pending()) return restart_syscall();" in copy_process.

I just shiver at the thought of leaving the code that way.  That is just
leaving a mess for later and the signal handling code already has way
too many of those.

So I am going to try and work through all of the cases.

I might even implement queueing shared signals for after the fork.  As
it is looking increasingly less difficult.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-11  2:44                         ` [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID Eric W. Biederman
@ 2018-07-16 12:51                           ` Oleg Nesterov
  2018-07-16 14:50                             ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-16 12:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

On 07/10, Eric W. Biederman wrote:
>
> Now that we can make the distinction use PIDTYPE_TGID rather than
> PIDTYPE_PID.

Wai, wait, this doesn't look right...

> There is no immediate effect as they point point at the
> same task,

How so? pid_task(pid, PIDTYPE_TGID) will return NULL unless this pid is actually
a group leader's pid,

> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1315,7 +1315,7 @@ int kill_pid_info(int sig, struct siginfo *info, struct pid *pid)
>
>  	for (;;) {
>  		rcu_read_lock();
> -		p = pid_task(pid, PIDTYPE_PID);
> +		p = pid_task(pid, PIDTYPE_TGID);
>  		if (p)
>  			error = group_send_sig_info(sig, info, p);

So, currently kill(pid_nr) always works, even if pid_nr is a sub-thread's tid.

After this change kill(2) will always fail with -ESRCH in this case.

Or I am totally confused?

> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -347,12 +347,11 @@ int posix_timer_event(struct k_itimer *timr, int si_private)
>  	 */
>  	timr->sigq->info.si_sys_private = si_private;
>  
> +	shared = !(timr->it_sigev_notify & SIGEV_THREAD_ID);
>  	rcu_read_lock();
> -	task = pid_task(timr->it_pid, PIDTYPE_PID);
> -	if (task) {
> -		shared = !(timr->it_sigev_notify & SIGEV_THREAD_ID);
> +	task = pid_task(timr->it_pid, shared ? PIDTYPE_TGID : PIDTYPE_PID);

This looks fine, afaics without SIGEV_THREAD_ID ->it_pid is alwats task_tgid().

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-16 12:51                           ` Oleg Nesterov
@ 2018-07-16 14:50                             ` Eric W. Biederman
  2018-07-16 17:17                               ` Linus Torvalds
  2018-07-17 16:38                               ` Linus Torvalds
  0 siblings, 2 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-16 14:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/10, Eric W. Biederman wrote:
>>
>> Now that we can make the distinction use PIDTYPE_TGID rather than
>> PIDTYPE_PID.
>
> Wai, wait, this doesn't look right...
>
>> There is no immediate effect as they point point at the
>> same task,
>
> How so? pid_task(pid, PIDTYPE_TGID) will return NULL unless this pid is actually
> a group leader's pid,
>
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -1315,7 +1315,7 @@ int kill_pid_info(int sig, struct siginfo *info, struct pid *pid)
>>
>>  	for (;;) {
>>  		rcu_read_lock();
>> -		p = pid_task(pid, PIDTYPE_PID);
>> +		p = pid_task(pid, PIDTYPE_TGID);
>>  		if (p)
>>  			error = group_send_sig_info(sig, info, p);
>
> So, currently kill(pid_nr) always works, even if pid_nr is a sub-thread's tid.
>
> After this change kill(2) will always fail with -ESRCH in this case.
>
> Or I am totally confused?

No you are not.

That does at least need to be documented in the description of the
patch.

In practice since glibc does not make thread id's available I don't
expect anyone relies on this behavior.  Since no one relies on it we
can change it without creating a regression.

I believe this can be described as fixing a bug that we were not able to
before the introduction of PIDTYPE_TGID.

I will update my change description.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK  to forcibly kill tasks
  2018-07-11  2:44                         ` [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK to forcibly kill tasks Eric W. Biederman
@ 2018-07-16 14:55                           ` Oleg Nesterov
  2018-07-16 15:08                             ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-16 14:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

On 07/10, Eric W. Biederman wrote:
>
> Therefore use do_send_sig_info in all cases in __do_SAK to kill
> tasks as allows for exactly what the code wants to do.

OK, but probably the changelog should also mention that now even the global
init will be killed if it has this tty opened.


> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  drivers/tty/tty_io.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> index cec58c53b0c4..42ac168c2a47 100644
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -2747,7 +2747,7 @@ void __do_SAK(struct tty_struct *tty)
>  	do_each_pid_task(session, PIDTYPE_SID, p) {
>  		tty_notice(tty, "SAK: killed process %d (%s): by session\n",
>  			   task_pid_nr(p), p->comm);
> -		send_sig(SIGKILL, p, 1);
> +		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	} while_each_pid_task(session, PIDTYPE_SID, p);
>  
>  	/* Now kill any processes that happen to have the tty open */
> @@ -2755,7 +2755,7 @@ void __do_SAK(struct tty_struct *tty)
>  		if (p->signal->tty == tty) {
>  			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
>  				   task_pid_nr(p), p->comm);
> -			send_sig(SIGKILL, p, 1);
> +			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  			continue;
>  		}
>  		task_lock(p);
> @@ -2763,7 +2763,7 @@ void __do_SAK(struct tty_struct *tty)
>  		if (i != 0) {
>  			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
>  				   task_pid_nr(p), p->comm, i - 1);
> -			force_sig(SIGKILL, p);
> +			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  		}
>  		task_unlock(p);
>  	} while_each_thread(g, p);
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK  to forcibly kill tasks
  2018-07-16 14:55                           ` Oleg Nesterov
@ 2018-07-16 15:08                             ` Eric W. Biederman
  2018-07-16 16:50                               ` Linus Torvalds
  2018-07-17 10:58                               ` Oleg Nesterov
  0 siblings, 2 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-16 15:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/10, Eric W. Biederman wrote:
>>
>> Therefore use do_send_sig_info in all cases in __do_SAK to kill
>> tasks as allows for exactly what the code wants to do.
>
> OK, but probably the changelog should also mention that now even the global
> init will be killed if it has this tty opened.

force_sig was ensuring the global init would die.  So that isn't a
change.  Mentioning it isn't a bad idea.

The change for global init is it will now die if init is a member of the
session or init is using this tty as it's controlling tty.

Semantically killing init with SAK is completely appropriate.  As
otherwise the guarantee that nothing has the terminal open will be
present.  So yes I will update the description.

Eric

>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  drivers/tty/tty_io.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
>> index cec58c53b0c4..42ac168c2a47 100644
>> --- a/drivers/tty/tty_io.c
>> +++ b/drivers/tty/tty_io.c
>> @@ -2747,7 +2747,7 @@ void __do_SAK(struct tty_struct *tty)
>>  	do_each_pid_task(session, PIDTYPE_SID, p) {
>>  		tty_notice(tty, "SAK: killed process %d (%s): by session\n",
>>  			   task_pid_nr(p), p->comm);
>> -		send_sig(SIGKILL, p, 1);
>> +		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>>  	} while_each_pid_task(session, PIDTYPE_SID, p);
>>  
>>  	/* Now kill any processes that happen to have the tty open */
>> @@ -2755,7 +2755,7 @@ void __do_SAK(struct tty_struct *tty)
>>  		if (p->signal->tty == tty) {
>>  			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
>>  				   task_pid_nr(p), p->comm);
>> -			send_sig(SIGKILL, p, 1);
>> +			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>>  			continue;
>>  		}
>>  		task_lock(p);
>> @@ -2763,7 +2763,7 @@ void __do_SAK(struct tty_struct *tty)
>>  		if (i != 0) {
>>  			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
>>  				   task_pid_nr(p), p->comm, i - 1);
>> -			force_sig(SIGKILL, p);
>> +			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>>  		}
>>  		task_unlock(p);
>>  	} while_each_thread(g, p);
>> -- 
>> 2.17.1
>> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK to forcibly kill tasks
  2018-07-16 15:08                             ` Eric W. Biederman
@ 2018-07-16 16:50                               ` Linus Torvalds
  2018-07-16 19:17                                 ` Eric W. Biederman
  2018-07-17 10:58                               ` Oleg Nesterov
  1 sibling, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2018-07-16 16:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Mon, Jul 16, 2018 at 8:08 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The change for global init is it will now die if init is a member of the
> session or init is using this tty as it's controlling tty.
>
> Semantically killing init with SAK is completely appropriate.

No.

Semtnaitcally killing init is completely wrong. Because it will kill
the whole system.

And I don't mean that in "now init won't spawn new things". I mean
that in "now we don't have a child reaper any more, and the system
will be dead because we'll panic on exit".

So it's not about the controlling tty, it's about fundamental kernel
internal consistency guarantees.

See

        write_unlock_irq(&tasklist_lock);
        if (unlikely(pid_ns == &init_pid_ns)) {
                panic("Attempted to kill init! exitcode=0x%08x\n",
                        father->signal->group_exit_code ?: father->exit_code);
        }

in kernel/exit.c.

               Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-16 14:50                             ` Eric W. Biederman
@ 2018-07-16 17:17                               ` Linus Torvalds
  2018-07-16 18:01                                 ` Eric W. Biederman
  2018-07-17 16:38                               ` Linus Torvalds
  1 sibling, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2018-07-16 17:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Mon, Jul 16, 2018 at 7:50 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> In practice since glibc does not make thread id's available I don't
> expect anyone relies on this behavior.  Since no one relies on it we
> can change it without creating a regression.

Maybe.

However, possibly not.

The thing is, glibc wasn't the original or only use of our threads. In
fact, there are people out there that use clone() directly, without
using it for posix threading. And Oleg was right to notice this,
because the traditional model was literally to just use "kill()" on
the pid returned from clone().

So the semantics of Linux kill() really is to kill the thread, not the
group leader. glibc's implementation of pthreads is not the only model
out there.

Now, it is possible that at none of the legacy uses use CLONE_THREAD
and thus aren't affected (because tgid will always be pid). So maybe
nobody notices.

But we really have three different 'kill' system calls:

 - the original 'kill' system call (#37 on x86-32).

   This looks up the thread ID, but signals the *group*.

 - tkill (#238)

   This looks up the thread, and signals the specific thread.

 - tgkill (#270)

   This looks up the tgid, and signals the group.

Modern glibc will not even use the original 'kill()' at all, I think.
But it's the legacy behavior.

So I do think Oleg is right. We should be careful. You'll not notice
breakage on a modern distro, but you might easily break old code.

                 Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-16 17:17                               ` Linus Torvalds
@ 2018-07-16 18:01                                 ` Eric W. Biederman
  2018-07-16 18:40                                   ` Linus Torvalds
  2018-07-17  9:56                                   ` Oleg Nesterov
  0 siblings, 2 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-16 18:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jul 16, 2018 at 7:50 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> In practice since glibc does not make thread id's available I don't
>> expect anyone relies on this behavior.  Since no one relies on it we
>> can change it without creating a regression.
>
> Maybe.
>
> However, possibly not.
>
> The thing is, glibc wasn't the original or only use of our threads. In
> fact, there are people out there that use clone() directly, without
> using it for posix threading. And Oleg was right to notice this,
> because the traditional model was literally to just use "kill()" on
> the pid returned from clone().

I completely agree that Oleg was right to notice this, and I was
definitely not right to overlook.  In my description and otherwise.

I also think the semantic change needs to happen in it's own separate
patch so things can be tracked down.

I really don't think anyone uses this but it is not smart to hold the
rest of the changes hostage to my belief.  So I am thinking about how
to rework this.

> So the semantics of Linux kill() really is to kill the thread, not the
> group leader. glibc's implementation of pthreads is not the only model
> out there.

There are two questions.
a) Can we use the pid of a thread to find the thread group?
b) Will the signal be queued in the thread group?

> Now, it is possible that at none of the legacy uses use CLONE_THREAD
> and thus aren't affected (because tgid will always be pid). So maybe
> nobody notices.

That is what I expect.  I don't know think legacy is a good description.
Calling other uses of CLONE_THREAD non-glibc seems better.  The old
LinuxThreads did not use CLONE_THREAD because it did not exist.
>
> But we really have three different 'kill' system calls:
>
>  - the original 'kill' system call (#37 on x86-32).
>
>    This looks up the thread ID, but signals the *group*.
>
>  - tkill (#238)
>
>    This looks up the thread, and signals the specific thread.
>
>  - tgkill (#270)
>
>    This looks up the tgid, and signals the group.

No.  tgkill is a less racy version of tkill and verifies that the
thread it signals is in the proper thread group.

> Modern glibc will not even use the original 'kill()' at all, I think.
> But it's the legacy behavior.

No.  Modern glibc definitely still uses kill.  As kill is the only one
exporting the posix kill API.  

> So I do think Oleg is right. We should be careful. You'll not notice
> breakage on a modern distro, but you might easily break old code.

Yes.  We definitely need to be careful.   At the same time since this
isn't something the old LinuxThreads had to cope with we can probably
clean it up.  But as that is not my focus it should probably be pushed out.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-16 18:01                                 ` Eric W. Biederman
@ 2018-07-16 18:40                                   ` Linus Torvalds
  2018-07-17  9:56                                   ` Oleg Nesterov
  1 sibling, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2018-07-16 18:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Mon, Jul 16, 2018 at 11:02 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> There are two questions.
> a) Can we use the pid of a thread to find the thread group?

Yes. Just find the thread, and then use p->tgid.

However, that's not what the code used to do. It used to just find the
thread, and then do "do_send_sig_info()" on it.

And it's actually *slightly* different than "find the thread group
based on the thread". At least the permission checks are different.
The permission checks are done on the thread.

> b) Will the signal be queued in the thread group?

Yes.

        pending = group ? &t->signal->shared_pending : &t->pending;

and "group" is true.

> > Now, it is possible that at none of the legacy uses use CLONE_THREAD
> > and thus aren't affected (because tgid will always be pid). So maybe
> > nobody notices.
>
> That is what I expect.  I don't know think legacy is a good description.
> Calling other uses of CLONE_THREAD non-glibc seems better.  The old
> LinuxThreads did not use CLONE_THREAD because it did not exist.

Again, don't get hung up about different libc implementations.

People have literally used clone() directly. And some of them use CLONE_THREAD.

Just google it. I guarantee you'll find examples of it, because I
found examples.

So stop the whole "libc" argument. That's not the point, and as long
as you make that argument, your argument is simply not valid.

People use clone() directly. Really. Really really.

            Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK to forcibly kill tasks
  2018-07-16 16:50                               ` Linus Torvalds
@ 2018-07-16 19:17                                 ` Eric W. Biederman
  2018-07-16 19:36                                   ` Linus Torvalds
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-16 19:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jul 16, 2018 at 8:08 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The change for global init is it will now die if init is a member of the
>> session or init is using this tty as it's controlling tty.
>>
>> Semantically killing init with SAK is completely appropriate.
>
> No.
>
> Semtnaitcally killing init is completely wrong. Because it will kill
> the whole system.
>
> And I don't mean that in "now init won't spawn new things". I mean
> that in "now we don't have a child reaper any more, and the system
> will be dead because we'll panic on exit".
>
> So it's not about the controlling tty, it's about fundamental kernel
> internal consistency guarantees.
>
> See
>
>         write_unlock_irq(&tasklist_lock);
>         if (unlikely(pid_ns == &init_pid_ns)) {
>                 panic("Attempted to kill init! exitcode=0x%08x\n",
>                         father->signal->group_exit_code ?: father->exit_code);
>         }
>
> in kernel/exit.c.

I should have said it doesn't matter because init does not open ttys and
become a member of session groups.  Or at least it never has in my
experience.  The only way I know to get that behavior is to boot with
init=/bin/bash.

With the force_sig already in do_SAK today my change is not a
regression.  As force_sig in a completely different way forces the
signal to init.


Looking deeper, all of the silliness with SEND_SIG_FORCED and
force_sig_info is to guarantee delivery of synchronous exceptions even
to init.

So I think we want the patch below to clean that up.  Then we don't have
to worry about the wrong things sending signals to init by accident, and
SEND_SIG_FORCED becomes just SEND_SIG_PRIV that skips the unnecesary
allocation of a siginfo struct.

Thoughts?

Eric


From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Mon, 16 Jul 2018 13:29:04 -0500
Subject: [PATCH] signal: Cleanup delivery of exceptions to init

- Stop clearing SIGNAL_UNKILLABLE. It makes interaction by
  the process with other signals problematic, and exceptions
  are not necessarily fatal.

- Don't allow SIGKILL and SIGSTOP to the global init.
  It never helps and it it can only make things worse.

- Explicitly allow exceptions to any kind of init.  They are
  exceptions and synchronous and need to be handled somehow.
  Init can setup a handler or deal with the default action.

  This is not a change it is just code movement from
  force_sig_info into send_signal and get_signal.

- Treat all signals from the kernel as if they are from an ancestor
  pid namespace.

- Take out the overrides of SIGNAL_UNKILLABLE from force_sig_info
  The changes to send_signal and get_signal make them unnecessary.

- Take out the SEND_SIG_FORCED overrides from prepare_signal.
  The changes to send_signal makes it redundant.

- Rename force back to from_ancestor_ns as that makes the logic
  with respect to namespaces clearer and logically if the kernel
  is sending you a signal it is from your ancestor namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/signal.c | 41 ++++++++++++++++-------------------------
 1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 94296afacf44..298f5c690681 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -72,20 +72,21 @@ static int sig_handler_ignored(void __user *handler, int sig)
 		(handler == SIG_DFL && sig_kernel_ignore(sig));
 }
 
-static int sig_task_ignored(struct task_struct *t, int sig, bool force)
+static int sig_task_ignored(struct task_struct *t, int sig, bool from_ancestor_ns)
 {
 	void __user *handler;
 
 	handler = sig_handler(t, sig);
 
 	if (unlikely(t->signal->flags & SIGNAL_UNKILLABLE) &&
-	    handler == SIG_DFL && !(force && sig_kernel_only(sig)))
+	    handler == SIG_DFL &&
+	    (is_global_init(t) || !(from_ancestor_ns && sig_kernel_only(sig))))
 		return 1;
 
 	return sig_handler_ignored(handler, sig);
 }
 
-static int sig_ignored(struct task_struct *t, int sig, bool force)
+static int sig_ignored(struct task_struct *t, int sig, bool from_ancestor_ns)
 {
 	/*
 	 * Blocked signals are never ignored, since the
@@ -103,7 +104,7 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
 	if (t->ptrace && sig != SIGKILL)
 		return 0;
 
-	return sig_task_ignored(t, sig, force);
+	return sig_task_ignored(t, sig, from_ancestor_ns);
 }
 
 /*
@@ -809,7 +810,7 @@ static void ptrace_trap_notify(struct task_struct *t)
  * Returns true if the signal should be actually delivered, otherwise
  * it should be dropped.
  */
-static bool prepare_signal(int sig, struct task_struct *p, bool force)
+static bool prepare_signal(int sig, struct task_struct *p, bool from_ancestor_ns)
 {
 	struct signal_struct *signal = p->signal;
 	struct task_struct *t;
@@ -871,7 +872,7 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
 		}
 	}
 
-	return !sig_ignored(p, sig, force);
+	return !sig_ignored(p, sig, from_ancestor_ns);
 }
 
 /*
@@ -1008,8 +1009,7 @@ static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struc
 	assert_spin_locked(&t->sighand->siglock);
 
 	result = TRACE_SIGNAL_IGNORED;
-	if (!prepare_signal(sig, t,
-			from_ancestor_ns || (info == SEND_SIG_FORCED)))
+	if (!prepare_signal(sig, t, from_ancestor_ns))
 		goto ret;
 
 	pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
@@ -1107,12 +1107,8 @@ static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struc
 static int send_signal(int sig, struct kernel_siginfo *info, struct task_struct *t,
 			enum pid_type type)
 {
-	int from_ancestor_ns = 0;
-
-#ifdef CONFIG_PID_NS
-	from_ancestor_ns = si_fromuser(info) &&
+	int from_ancestor_ns = !si_fromuser(info) ||
 			   !task_pid_nr_ns(current, task_active_pid_ns(t));
-#endif
 
 	return __send_signal(sig, info, t, type, from_ancestor_ns);
 }
@@ -1178,8 +1174,8 @@ int do_send_sig_info(int sig, struct kernel_siginfo *info, struct task_struct *p
  * since we do not want to have a signal handler that was blocked
  * be invoked when user space had explicitly blocked it.
  *
- * We don't want to have recursive SIGSEGV's etc, for example,
- * that is why we also clear SIGNAL_UNKILLABLE.
+ * Exceptions are always delivered so we don't need to worry
+ * about init or other processes blocking exception signals.
  */
 int
 force_sig_info(int sig, struct kernel_siginfo *info, struct task_struct *t)
@@ -1199,12 +1195,6 @@ force_sig_info(int sig, struct kernel_siginfo *info, struct task_struct *t)
 			recalc_sigpending_and_wake(t);
 		}
 	}
-	/*
-	 * Don't clear SIGNAL_UNKILLABLE for traced tasks, users won't expect
-	 * debugging to leave init killable.
-	 */
-	if (action->sa.sa_handler == SIG_DFL && !t->ptrace)
-		t->signal->flags &= ~SIGNAL_UNKILLABLE;
 	ret = send_signal(sig, info, t, PIDTYPE_PID);
 	spin_unlock_irqrestore(&t->sighand->siglock, flags);
 
@@ -2536,9 +2526,10 @@ int get_signal(struct ksignal *ksig)
 			continue;
 
 		/*
-		 * Global init gets no signals it doesn't want.
-		 * Container-init gets no signals it doesn't want from same
-		 * container.
+		 * Except for synchronous exceptions (!SI_FROMUSER)
+		 * global init gets no signals it doesn't want, and
+		 * container-init gets no signals it doesn't want from
+		 * same container.
 		 *
 		 * Note that if global/container-init sees a sig_kernel_only()
 		 * signal here, the signal must have been generated internally
@@ -2546,7 +2537,7 @@ int get_signal(struct ksignal *ksig)
 		 * case, the signal cannot be dropped.
 		 */
 		if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
-				!sig_kernel_only(signr))
+		    !sig_kernel_only(signr) && SI_FROMUSER(&ksig->info))
 			continue;
 
 		if (sig_kernel_stop(signr)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK to forcibly kill tasks
  2018-07-16 19:17                                 ` Eric W. Biederman
@ 2018-07-16 19:36                                   ` Linus Torvalds
  2018-07-17  1:48                                     ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2018-07-16 19:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Mon, Jul 16, 2018 at 12:17 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> I should have said it doesn't matter because init does not open ttys and
> become a member of session groups.  Or at least it never has in my
> experience.  The only way I know to get that behavior is to boot with
> init=/bin/bash.

That's hopefully true, yes.

Presumably init does open the console, but hopefull doesn't do setsid.

(We *do* do "setsid()" for the linuxrc running, but that's not done by
the init thread itself).

> With the force_sig already in do_SAK today my change is not a
> regression.  As force_sig in a completely different way forces the
> signal to init.

Ok. A couple of notes in the commit description on this might be good.

> So I think we want the patch below to clean that up.  Then we don't have
> to worry about the wrong things sending signals to init by accident, and
> SEND_SIG_FORCED becomes just SEND_SIG_PRIV that skips the unnecesary
> allocation of a siginfo struct.
>
> Thoughts?

I think the end result is fine, but then I look at that patch of yours
and it does many other things and that makes me nervous.

Can you separate out the different things it does into separate
patches to make it easier to read?

           Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK to forcibly kill tasks
  2018-07-16 19:36                                   ` Linus Torvalds
@ 2018-07-17  1:48                                     ` Eric W. Biederman
  0 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-17  1:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jul 16, 2018 at 12:17 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> I should have said it doesn't matter because init does not open ttys and
>> become a member of session groups.  Or at least it never has in my
>> experience.  The only way I know to get that behavior is to boot with
>> init=/bin/bash.
>
> That's hopefully true, yes.
>
> Presumably init does open the console, but hopefull doesn't do setsid.
>
> (We *do* do "setsid()" for the linuxrc running, but that's not done by
> the init thread itself).
>
>> With the force_sig already in do_SAK today my change is not a
>> regression.  As force_sig in a completely different way forces the
>> signal to init.
>
> Ok. A couple of notes in the commit description on this might be good.

Definitely.

>> So I think we want the patch below to clean that up.  Then we don't have
>> to worry about the wrong things sending signals to init by accident, and
>> SEND_SIG_FORCED becomes just SEND_SIG_PRIV that skips the unnecesary
>> allocation of a siginfo struct.
>>
>> Thoughts?
>
> I think the end result is fine, but then I look at that patch of yours
> and it does many other things and that makes me nervous.
>
> Can you separate out the different things it does into separate
> patches to make it easier to read?

I will take a look.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-16 18:01                                 ` Eric W. Biederman
  2018-07-16 18:40                                   ` Linus Torvalds
@ 2018-07-17  9:56                                   ` Oleg Nesterov
  2018-07-17 10:18                                     ` Oleg Nesterov
  1 sibling, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-17  9:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On 07/16, Eric W. Biederman wrote:
>
> There are two questions.
> a) Can we use the pid of a thread to find the thread group?
> b) Will the signal be queued in the thread group?

IMO "yes" to both questions, I simply see no reason to change the current
semantics. Even if glibc doesn't show the tread id's a user can see them
in /proc/$tgid/task/. So I think kill_pid_info() should just do

	p = pid_task(pid, PIDTYPE_PID);
	group_send_sig_info(p, PIDTYPE_TGID);

again, posix_timer_event() looks fine, but to me

	pid_task(timr->it_pid, shared ? PIDTYPE_TGID : PIDTYPE_PID)

looks like unnecessary complication,

	pid_task(timr->it_pid, PIDTYPE_PID);

should do the same thing.

And, I didn't mention this yesterday, but probably the next 08/11 patch can
have the same problem. But this is a bit more complicated because send_sigio()
uses the same "type" both for do_each_pid_task() and as an argument passed to
do_send_sig_info().

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-17  9:56                                   ` Oleg Nesterov
@ 2018-07-17 10:18                                     ` Oleg Nesterov
  2018-07-20 23:41                                       ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-17 10:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On 07/17, Oleg Nesterov wrote:
>
> And, I didn't mention this yesterday, but probably the next 08/11 patch can
> have the same problem. But this is a bit more complicated because send_sigio()
> uses the same "type" both for do_each_pid_task() and as an argument passed to
> do_send_sig_info().

perhaps it can simply do

	if (type <= PIDTYPE_TGID) {
		rcu_read_lock();
		p = pid_task(pid, PIDTYPE_PID);
		send_sigio_to_task(p, fown, fd, band, type);
		rcu_read_unlock();
	} else {
		read_lock(&tasklist_lock);
		do_each_pid_task(pid, type, p) {
			send_sigio_to_task(p, fown, fd, band, type);
		} while_each_pid_task(pid, type, p);
		read_unlock(&tasklist_lock);
	}

this way we also avoid tasklist_lock in F_OWNER_TID/F_OWNER_PID case.

To clarify, it is not that I think any sane application can do
fcntl(F_OWNER_PID, thread_tid) but still this is a user-visible change
we can easily avoid.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK  to forcibly kill tasks
  2018-07-16 15:08                             ` Eric W. Biederman
  2018-07-16 16:50                               ` Linus Torvalds
@ 2018-07-17 10:58                               ` Oleg Nesterov
  1 sibling, 0 replies; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-17 10:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

On 07/16, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 07/10, Eric W. Biederman wrote:
> >>
> >> Therefore use do_send_sig_info in all cases in __do_SAK to kill
> >> tasks as allows for exactly what the code wants to do.
> >
> > OK, but probably the changelog should also mention that now even the global
> > init will be killed if it has this tty opened.
>
> force_sig was ensuring the global init would die.  So that isn't a
> change.  Mentioning it isn't a bad idea.

I meant another "p->signal->tty == tty" case which uses send_sig(SIGKILL).

As for force_sig(), yes it kills init, but "by accident". See your commit
20ac94378 "do_SAK: Don't recursively take the tasklist_lock", it replaced
send_sig() because it took tasklist_lock.

Nevermind, let me repeat I am not arguing with this change.

But it looks off-topic in this series, why do we need it? Yes, these
send_sig/force_sig are ugly, we need do_send_sig_info(PIDTYPE_TGID). But
__do_SAK() needs more cleanups, do_each_thread() is ugly too by the same
reason, we should not send SIGKILL per-thread. And iirc it is racy either
way, a process can open tty right after it was checkeda process can open
tty right after it was checked.

I think the main loop should be rewritten as

	for_each_process(p) {
		if (p->signal->tty == tty) {
			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
				   task_pid_nr(p), p->comm);
			goto kill;
		}

		files = NULL;
		for_each_thread(p, t) {
				if (t->files == files) /* racy but we do not care */
					continue;

				task_lock(t);
				files = t->files;
				i = iterate_fd(files, 0, this_tty, tty);
				task_unlock(t);

				if (i != 0) {
					tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
						   task_pid_nr(p), p->comm, i - 1);
					goto kill;
				}
		}

		continue;
 kill:
		do_send_sig_info(SIGKILL, SEND_SIG_NOINFO, p, true);
	}

If we want to kill init's as well, we can use SEND_SIG_FORCE and this can come
as a separate change, although I am personally fine either way.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 05/11] pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  2018-07-11  2:44                         ` [RFC][PATCH 05/11] pids: Move the pgrp and session pid pointers from task_struct to signal_struct Eric W. Biederman
@ 2018-07-17 11:59                           ` Oleg Nesterov
  0 siblings, 0 replies; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-17 11:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

This is a bit offtopic/cosmetic and we can do this later, just for record
before I forget this...

On 07/10, Eric W. Biederman wrote:
>
> +static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
> +{
> +	return (type == PIDTYPE_PID) ?
> +		&task->thread_pid :
> +		(type == __PIDTYPE_TGID) ?
> +		&task->signal->leader_pid :
> +		&task->signal->pids[type];
> +}

This new helper is (simplified after you removed __PIDTYPE_TGID/leader_pid) can
have more users: task_pid_type(), init_task_pid(). In fact even (say) task_tgid()
can use it if we export it inline.

And if we make task_pid_ptr() the only user of signal->pids[] array we can shrink
it, signal->pids[0] is not used. Or may be we can simply redefine enum pid_type,
we can define PIDTYPE_PID == -1 or move it at the end, or do something else.

Once again, this is just random/minor thoughts, feel free to ignore.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-16 14:50                             ` Eric W. Biederman
  2018-07-16 17:17                               ` Linus Torvalds
@ 2018-07-17 16:38                               ` Linus Torvalds
  2018-07-20 23:27                                 ` Eric W. Biederman
  1 sibling, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2018-07-17 16:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Mon, Jul 16, 2018 at 7:50 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> In practice since glibc does not make thread id's available I don't
> expect anyone relies on this behavior.  Since no one relies on it we
> can change it without creating a regression.

Actually, there's a really obvious case where this simply isn't true.

Just imagine you're a MIS person or a developer, doing "ps -eLf" to
see what's going on, and want to kill one thread. Either because you
see that one thread using all CPU, or because you are the developer
and you know what's up.

Those thread ID's are exported trivially.

               Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-17 16:38                               ` Linus Torvalds
@ 2018-07-20 23:27                                 ` Eric W. Biederman
  0 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-20 23:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jul 16, 2018 at 7:50 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> In practice since glibc does not make thread id's available I don't
>> expect anyone relies on this behavior.  Since no one relies on it we
>> can change it without creating a regression.
>
> Actually, there's a really obvious case where this simply isn't true.
>
> Just imagine you're a MIS person or a developer, doing "ps -eLf" to
> see what's going on, and want to kill one thread. Either because you
> see that one thread using all CPU, or because you are the developer
> and you know what's up.
>
> Those thread ID's are exported trivially.

True.  Which makes all of this shell script visible.  So someone may
have done something with this functionality.

I have just gone through all of my patches and updated them to ensure
that everything has the same behavior when selecting processes as it does
today.  So this will not be an issue with the next version this patch series.



I am going to come back to this as there are some really nasty corner
cases in the current kernel.  Primarily that we can send signals through
a zombie thread group leader and it can have unchangable credentials
completely out of sync with the credentials on the other threads.

Eric











^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID
  2018-07-17 10:18                                     ` Oleg Nesterov
@ 2018-07-20 23:41                                       ` Eric W. Biederman
  0 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-20 23:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/17, Oleg Nesterov wrote:
>>
>> And, I didn't mention this yesterday, but probably the next 08/11 patch can
>> have the same problem. But this is a bit more complicated because send_sigio()
>> uses the same "type" both for do_each_pid_task() and as an argument passed to
>> do_send_sig_info().
>
> perhaps it can simply do
>
> 	if (type <= PIDTYPE_TGID) {
> 		rcu_read_lock();
> 		p = pid_task(pid, PIDTYPE_PID);
> 		send_sigio_to_task(p, fown, fd, band, type);
> 		rcu_read_unlock();
> 	} else {
> 		read_lock(&tasklist_lock);
> 		do_each_pid_task(pid, type, p) {
> 			send_sigio_to_task(p, fown, fd, band, type);
> 		} while_each_pid_task(pid, type, p);
> 		read_unlock(&tasklist_lock);
> 	}
>
> this way we also avoid tasklist_lock in F_OWNER_TID/F_OWNER_PID case.

I like that.  I updated that code in a different way but that looks
more elegant and I think I will incoporate it.

> To clarify, it is not that I think any sane application can do
> fcntl(F_OWNER_PID, thread_tid) but still this is a user-visible change
> we can easily avoid.

Agreed.

I do think 

Eric


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 00/20] PIDTYPE_TGID removal of fork restarts
  2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
                                           ` (10 preceding siblings ...)
  2018-07-11  2:44                         ` [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork Eric W. Biederman
@ 2018-07-24  3:22                         ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 01/20] pids: Initialize leader_pid in init_task Eric W. Biederman
                                             ` (22 more replies)
  11 siblings, 23 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang


This took longer than I thought to address all of the issues and double
check I am not missing something.  I have split of a few of the patches
so now the patch series appears longer.   It now covers less ground.

I realized while reviewing the group signals that for none of them is
siginfo important.  Which means by slightly lowering our quality of
implementation in delivering those signals to a brand new process (by
not queueing siginfo) I can collect them all in a sigset and the code
is no more difficult than a sequence counter.  Which means it is
straight forward to completely eliminate restarts from fork.

The implemenatation of PIDTYPE_TGID remains the same.  How it gets used
has changed to guarantee that looking up a thread group by the pid of
one of it's threads and sending it a signal continues to work exactly
the same as before.

Please take a look and verify that I have caught everything.  I think I
have but if not please let me know.

Thank you in advance,
Eric

Eric W. Biederman (20):
      pids: Initialize leader_pid in init_task
      pids: Move task_pid_type into sched/signal.h
      pids: Compute task_tgid using signal->leader_pid
      kvm: Don't open code task_pid in kvm_vcpu_ioctl
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct
      pid: Implement PIDTYPE_TGID
      signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
      posix-timers: Noralize good_sigevent
      signal: Pass pid and pid type into send_sigqueue
      signal: Pass pid type into group_send_sig_info
      signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
      signal: Pass pid type into do_send_sig_info
      signal: Push pid type down into send_signal
      signal: Push pid type down into __send_signal
      signal: Push pid type down into complete_signal.
      fork: Move and describe why the code examines PIDNS_ADDING
      fork: Unconditionally exit if a fatal signal is pending
      signal: Add calculate_sigpending()
      fork: Have new threads join on-going signal group stops
      signal: Don't restart fork when signals come in.

 arch/ia64/kernel/asm-offsets.c       |  4 +-
 arch/ia64/kernel/fsys.S              | 12 ++---
 arch/s390/kernel/perf_cpum_sf.c      |  2 +-
 drivers/net/tun.c                    |  2 +-
 drivers/platform/x86/thinkpad_acpi.c |  1 +
 drivers/tty/sysrq.c                  |  2 +-
 drivers/tty/tty_io.c                 |  2 +-
 fs/autofs/autofs_i.h                 |  1 +
 fs/exec.c                            |  1 +
 fs/fcntl.c                           | 72 +++++++++++++--------------
 fs/fuse/file.c                       |  1 +
 fs/locks.c                           |  2 +-
 fs/notify/dnotify/dnotify.c          |  3 +-
 fs/notify/fanotify/fanotify.c        |  1 +
 include/linux/init_task.h            |  9 ----
 include/linux/pid.h                  | 11 +----
 include/linux/sched.h                | 31 +++---------
 include/linux/sched/signal.h         | 49 +++++++++++++++++--
 include/linux/signal.h               |  6 ++-
 include/net/scm.h                    |  1 +
 init/init_task.c                     | 12 +++--
 kernel/events/core.c                 |  2 +-
 kernel/exit.c                        | 12 ++---
 kernel/fork.c                        | 70 +++++++++++++++++++--------
 kernel/pid.c                         | 42 ++++++++--------
 kernel/signal.c                      | 94 ++++++++++++++++++++++++++----------
 kernel/time/itimer.c                 |  5 +-
 kernel/time/posix-cpu-timers.c       |  2 +-
 kernel/time/posix-timers.c           | 21 ++++----
 mm/oom_kill.c                        |  4 +-
 virt/kvm/kvm_main.c                  |  2 +-
 31 files changed, 282 insertions(+), 197 deletions(-)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 01/20] pids: Initialize leader_pid in init_task
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 02/20] pids: Move task_pid_type into sched/signal.h Eric W. Biederman
                                             ` (21 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

This is cheap and no cost so we might as well.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 init/init_task.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/init/init_task.c b/init/init_task.c
index 74f60baa2799..7914ffb8dc73 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -33,6 +33,7 @@ static struct signal_struct init_signals = {
 	},
 #endif
 	INIT_CPU_TIMERS(init_signals)
+	.leader_pid = &init_struct_pid,
 	INIT_PREV_CPUTIME(init_signals)
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 02/20] pids: Move task_pid_type into sched/signal.h
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 01/20] pids: Initialize leader_pid in init_task Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 03/20] pids: Compute task_tgid using signal->leader_pid Eric W. Biederman
                                             ` (20 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

The function is general and inline so there is no need
to hide it inside of exit.c

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h | 8 ++++++++
 kernel/exit.c                | 8 --------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 113d1ad1ced7..d8ef0a3d2e7e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -556,6 +556,14 @@ extern bool current_is_single_threaded(void);
 typedef int (*proc_visitor)(struct task_struct *p, void *data);
 void walk_process_tree(struct task_struct *top, proc_visitor, void *);
 
+static inline
+struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
+{
+	if (type != PIDTYPE_PID)
+		task = task->group_leader;
+	return task->pids[type].pid;
+}
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
 	return tsk->signal->nr_threads;
diff --git a/kernel/exit.c b/kernel/exit.c
index c3c7ac560114..16432428fc6c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1001,14 +1001,6 @@ struct wait_opts {
 	int			notask_error;
 };
 
-static inline
-struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
-{
-	if (type != PIDTYPE_PID)
-		task = task->group_leader;
-	return task->pids[type].pid;
-}
-
 static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
 {
 	return	wo->wo_type == PIDTYPE_MAX ||
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 03/20] pids: Compute task_tgid using signal->leader_pid
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 01/20] pids: Initialize leader_pid in init_task Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 02/20] pids: Move task_pid_type into sched/signal.h Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 04/20] kvm: Don't open code task_pid in kvm_vcpu_ioctl Eric W. Biederman
                                             ` (19 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.

__task_pid_nr_ns has been updated to take advantage of this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/ia64/kernel/asm-offsets.c       |  2 +-
 arch/ia64/kernel/fsys.S              |  8 ++++----
 drivers/platform/x86/thinkpad_acpi.c |  1 +
 fs/fuse/file.c                       |  1 +
 fs/notify/fanotify/fanotify.c        |  1 +
 include/linux/sched.h                |  5 -----
 include/linux/sched/signal.h         |  5 +++++
 include/net/scm.h                    |  1 +
 kernel/pid.c                         | 15 ++++++++-------
 9 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c
index f4db2168d1b8..f5433bb7f04a 100644
--- a/arch/ia64/kernel/asm-offsets.c
+++ b/arch/ia64/kernel/asm-offsets.c
@@ -50,7 +50,6 @@ void foo(void)
 
 	DEFINE(IA64_TASK_BLOCKED_OFFSET,offsetof (struct task_struct, blocked));
 	DEFINE(IA64_TASK_CLEAR_CHILD_TID_OFFSET,offsetof (struct task_struct, clear_child_tid));
-	DEFINE(IA64_TASK_GROUP_LEADER_OFFSET, offsetof (struct task_struct, group_leader));
 	DEFINE(IA64_TASK_TGIDLINK_OFFSET, offsetof (struct task_struct, pids[PIDTYPE_PID].pid));
 	DEFINE(IA64_PID_LEVEL_OFFSET, offsetof (struct pid, level));
 	DEFINE(IA64_PID_UPID_OFFSET, offsetof (struct pid, numbers[0]));
@@ -68,6 +67,7 @@ void foo(void)
 	DEFINE(IA64_SIGNAL_GROUP_STOP_COUNT_OFFSET,offsetof (struct signal_struct,
 							     group_stop_count));
 	DEFINE(IA64_SIGNAL_SHARED_PENDING_OFFSET,offsetof (struct signal_struct, shared_pending));
+	DEFINE(IA64_SIGNAL_LEADER_PID_OFFSET, offsetof (struct signal_struct, leader_pid));
 
 	BLANK();
 
diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S
index fe742ffafc7a..eaf5a0d6f3e0 100644
--- a/arch/ia64/kernel/fsys.S
+++ b/arch/ia64/kernel/fsys.S
@@ -62,16 +62,16 @@ ENTRY(fsys_getpid)
 	.prologue
 	.altrp b6
 	.body
-	add r17=IA64_TASK_GROUP_LEADER_OFFSET,r16
+	add r17=IA64_TASK_SIGNAL_OFFSET,r16
 	;;
-	ld8 r17=[r17]				// r17 = current->group_leader
+	ld8 r17=[r17]				// r17 = current->signal
 	add r9=TI_FLAGS+IA64_TASK_SIZE,r16
 	;;
 	ld4 r9=[r9]
-	add r17=IA64_TASK_TGIDLINK_OFFSET,r17
+	add r17=IA64_SIGNAL_LEADER_PID_OFFSET,r17
 	;;
 	and r9=TIF_ALLWORK_MASK,r9
-	ld8 r17=[r17]				// r17 = current->group_leader->pids[PIDTYPE_PID].pid
+	ld8 r17=[r17]				// r17 = current->signal->leader_pid
 	;;
 	add r8=IA64_PID_LEVEL_OFFSET,r17
 	;;
diff --git a/drivers/platform/x86/thinkpad_acpi.c b/drivers/platform/x86/thinkpad_acpi.c
index cae9b0595692..d556e95c532c 100644
--- a/drivers/platform/x86/thinkpad_acpi.c
+++ b/drivers/platform/x86/thinkpad_acpi.c
@@ -57,6 +57,7 @@
 #include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/sched.h>
+#include <linux/sched/signal.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/delay.h>
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a201fb0ac64f..b00a3f126a89 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -12,6 +12,7 @@
 #include <linux/slab.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sched/signal.h>
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/swap.h>
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index f90842efea13..6e828cb82e5e 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -8,6 +8,7 @@
 #include <linux/mount.h>
 #include <linux/sched.h>
 #include <linux/sched/user.h>
+#include <linux/sched/signal.h>
 #include <linux/types.h>
 #include <linux/wait.h>
 #include <linux/audit.h>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..a461ff89a3af 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1202,11 +1202,6 @@ static inline struct pid *task_pid(struct task_struct *task)
 	return task->pids[PIDTYPE_PID].pid;
 }
 
-static inline struct pid *task_tgid(struct task_struct *task)
-{
-	return task->group_leader->pids[PIDTYPE_PID].pid;
-}
-
 /*
  * Without tasklist or RCU lock it is not safe to dereference
  * the result of task_pgrp/task_session even if task == current,
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index d8ef0a3d2e7e..b95a272c1ab5 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -564,6 +564,11 @@ struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
 	return task->pids[type].pid;
 }
 
+static inline struct pid *task_tgid(struct task_struct *task)
+{
+	return task->signal->leader_pid;
+}
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
 	return tsk->signal->nr_threads;
diff --git a/include/net/scm.h b/include/net/scm.h
index 903771c8d4e3..1ce365f4c256 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -8,6 +8,7 @@
 #include <linux/security.h>
 #include <linux/pid.h>
 #include <linux/nsproxy.h>
+#include <linux/sched/signal.h>
 
 /* Well, we should have at least one descriptor open
  * to accept passed FDs 8)
diff --git a/kernel/pid.c b/kernel/pid.c
index 157fe4b19971..d0de2b59f86f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -421,13 +421,14 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 	if (!ns)
 		ns = task_active_pid_ns(current);
 	if (likely(pid_alive(task))) {
-		if (type != PIDTYPE_PID) {
-			if (type == __PIDTYPE_TGID)
-				type = PIDTYPE_PID;
-
-			task = task->group_leader;
-		}
-		nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
+		struct pid *pid;
+		if (type == PIDTYPE_PID)
+			pid = task_pid(task);
+		else if (type == __PIDTYPE_TGID)
+			pid = task_tgid(task);
+		else
+			pid = rcu_dereference(task->group_leader->pids[type].pid);
+		nr = pid_nr_ns(pid, ns);
 	}
 	rcu_read_unlock();
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 04/20] kvm: Don't open code task_pid in kvm_vcpu_ioctl
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (2 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 03/20] pids: Compute task_tgid using signal->leader_pid Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 05/20] pids: Move the pgrp and session pid pointers from task_struct to signal_struct Eric W. Biederman
                                             ` (18 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..4c593acc4510 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2560,7 +2560,7 @@ static long kvm_vcpu_ioctl(struct file *filp,
 		if (arg)
 			goto out;
 		oldpid = rcu_access_pointer(vcpu->pid);
-		if (unlikely(oldpid != current->pids[PIDTYPE_PID].pid)) {
+		if (unlikely(oldpid != task_pid(current))) {
 			/* The thread running this VCPU changed. */
 			struct pid *newpid;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 05/20] pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (3 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 04/20] kvm: Don't open code task_pid in kvm_vcpu_ioctl Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 06/20] pid: Implement PIDTYPE_TGID Eric W. Biederman
                                             ` (17 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.

This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/ia64/kernel/asm-offsets.c |  2 +-
 arch/ia64/kernel/fsys.S        |  4 +--
 fs/autofs/autofs_i.h           |  1 +
 include/linux/init_task.h      |  9 -------
 include/linux/pid.h            |  8 +-----
 include/linux/sched.h          | 22 +++--------------
 include/linux/sched/signal.h   | 26 +++++++++++++++++---
 init/init_task.c               | 11 +++++----
 kernel/fork.c                  | 23 +++++++++++++----
 kernel/pid.c                   | 45 +++++++++++++++++-----------------
 10 files changed, 78 insertions(+), 73 deletions(-)

diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c
index f5433bb7f04a..c1f8a57855af 100644
--- a/arch/ia64/kernel/asm-offsets.c
+++ b/arch/ia64/kernel/asm-offsets.c
@@ -50,7 +50,7 @@ void foo(void)
 
 	DEFINE(IA64_TASK_BLOCKED_OFFSET,offsetof (struct task_struct, blocked));
 	DEFINE(IA64_TASK_CLEAR_CHILD_TID_OFFSET,offsetof (struct task_struct, clear_child_tid));
-	DEFINE(IA64_TASK_TGIDLINK_OFFSET, offsetof (struct task_struct, pids[PIDTYPE_PID].pid));
+	DEFINE(IA64_TASK_THREAD_PID_OFFSET,offsetof (struct task_struct, thread_pid));
 	DEFINE(IA64_PID_LEVEL_OFFSET, offsetof (struct pid, level));
 	DEFINE(IA64_PID_UPID_OFFSET, offsetof (struct pid, numbers[0]));
 	DEFINE(IA64_TASK_PENDING_OFFSET,offsetof (struct task_struct, pending));
diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S
index eaf5a0d6f3e0..e85ebdac678b 100644
--- a/arch/ia64/kernel/fsys.S
+++ b/arch/ia64/kernel/fsys.S
@@ -96,11 +96,11 @@ ENTRY(fsys_set_tid_address)
 	.altrp b6
 	.body
 	add r9=TI_FLAGS+IA64_TASK_SIZE,r16
-	add r17=IA64_TASK_TGIDLINK_OFFSET,r16
+	add r17=IA64_TASK_THREAD_PID_OFFSET,r16
 	;;
 	ld4 r9=[r9]
 	tnat.z p6,p7=r32		// check argument register for being NaT
-	ld8 r17=[r17]				// r17 = current->pids[PIDTYPE_PID].pid
+	ld8 r17=[r17]				// r17 = current->thread_pid
 	;;
 	and r9=TIF_ALLWORK_MASK,r9
 	add r8=IA64_PID_LEVEL_OFFSET,r17
diff --git a/fs/autofs/autofs_i.h b/fs/autofs/autofs_i.h
index 9400a9f6318a..502812289850 100644
--- a/fs/autofs/autofs_i.h
+++ b/fs/autofs/autofs_i.h
@@ -18,6 +18,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
+#include <linux/sched/signal.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/uaccess.h>
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index a454b8aeb938..a7083a45a26c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -46,15 +46,6 @@ extern struct cred init_cred;
 #define INIT_CPU_TIMERS(s)
 #endif
 
-#define INIT_PID_LINK(type) 					\
-{								\
-	.node = {						\
-		.next = NULL,					\
-		.pprev = NULL,					\
-	},							\
-	.pid = &init_struct_pid,				\
-}
-
 #define INIT_TASK_COMM "swapper"
 
 /* Attach to the init_task data structure for proper alignment */
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7633d55d9a24..3d4c504dcc8c 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,12 +67,6 @@ struct pid
 
 extern struct pid init_struct_pid;
 
-struct pid_link
-{
-	struct hlist_node node;
-	struct pid *pid;
-};
-
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
@@ -177,7 +171,7 @@ pid_t pid_vnr(struct pid *pid);
 	do {								\
 		if ((pid) != NULL)					\
 			hlist_for_each_entry_rcu((task),		\
-				&(pid)->tasks[type], pids[type].node) {
+				&(pid)->tasks[type], pid_links[type]) {
 
 			/*
 			 * Both old and new leaders may be attached to
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a461ff89a3af..445bdf5b1f64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,8 @@ struct task_struct {
 	struct list_head		ptrace_entry;
 
 	/* PID/PID hash table linkage. */
-	struct pid_link			pids[PIDTYPE_MAX];
+	struct pid			*thread_pid;
+	struct hlist_node		pid_links[PIDTYPE_MAX];
 	struct list_head		thread_group;
 	struct list_head		thread_node;
 
@@ -1199,22 +1200,7 @@ struct task_struct {
 
 static inline struct pid *task_pid(struct task_struct *task)
 {
-	return task->pids[PIDTYPE_PID].pid;
-}
-
-/*
- * Without tasklist or RCU lock it is not safe to dereference
- * the result of task_pgrp/task_session even if task == current,
- * we can race with another thread doing sys_setsid/sys_setpgid.
- */
-static inline struct pid *task_pgrp(struct task_struct *task)
-{
-	return task->group_leader->pids[PIDTYPE_PGID].pid;
-}
-
-static inline struct pid *task_session(struct task_struct *task)
-{
-	return task->group_leader->pids[PIDTYPE_SID].pid;
+	return task->thread_pid;
 }
 
 /*
@@ -1263,7 +1249,7 @@ static inline pid_t task_tgid_nr(struct task_struct *tsk)
  */
 static inline int pid_alive(const struct task_struct *p)
 {
-	return p->pids[PIDTYPE_PID].pid != NULL;
+	return p->thread_pid != NULL;
 }
 
 static inline pid_t task_pgrp_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index b95a272c1ab5..2dcded16eb1e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -146,7 +146,9 @@ struct signal_struct {
 
 #endif
 
+	/* PID/PID hash table linkage. */
 	struct pid *leader_pid;
+	struct pid *pids[PIDTYPE_MAX];
 
 #ifdef CONFIG_NO_HZ_FULL
 	atomic_t tick_dep_mask;
@@ -559,9 +561,12 @@ void walk_process_tree(struct task_struct *top, proc_visitor, void *);
 static inline
 struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
 {
-	if (type != PIDTYPE_PID)
-		task = task->group_leader;
-	return task->pids[type].pid;
+	struct pid *pid;
+	if (type == PIDTYPE_PID)
+		pid = task_pid(task);
+	else
+		pid = task->signal->pids[type];
+	return pid;
 }
 
 static inline struct pid *task_tgid(struct task_struct *task)
@@ -569,6 +574,21 @@ static inline struct pid *task_tgid(struct task_struct *task)
 	return task->signal->leader_pid;
 }
 
+/*
+ * Without tasklist or RCU lock it is not safe to dereference
+ * the result of task_pgrp/task_session even if task == current,
+ * we can race with another thread doing sys_setsid/sys_setpgid.
+ */
+static inline struct pid *task_pgrp(struct task_struct *task)
+{
+	return task->signal->pids[PIDTYPE_PGID];
+}
+
+static inline struct pid *task_session(struct task_struct *task)
+{
+	return task->signal->pids[PIDTYPE_SID];
+}
+
 static inline int get_nr_threads(struct task_struct *tsk)
 {
 	return tsk->signal->nr_threads;
diff --git a/init/init_task.c b/init/init_task.c
index 7914ffb8dc73..db12a61259f1 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -34,6 +34,11 @@ static struct signal_struct init_signals = {
 #endif
 	INIT_CPU_TIMERS(init_signals)
 	.leader_pid = &init_struct_pid,
+	.pids = {
+		[PIDTYPE_PID]	= &init_struct_pid,
+		[PIDTYPE_PGID]	= &init_struct_pid,
+		[PIDTYPE_SID]	= &init_struct_pid,
+	},
 	INIT_PREV_CPUTIME(init_signals)
 };
 
@@ -112,11 +117,7 @@ struct task_struct init_task
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
-	.pids = {
-		[PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),
-		[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),
-		[PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),
-	},
+	.thread_pid	= &init_struct_pid,
 	.thread_group	= LIST_HEAD_INIT(init_task.thread_group),
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
 #ifdef CONFIG_AUDITSYSCALL
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..d2952162399b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1549,10 +1549,22 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
 static inline void posix_cpu_timers_init(struct task_struct *tsk) { }
 #endif
 
+static inline void init_task_pid_links(struct task_struct *task)
+{
+	enum pid_type type;
+
+	for (type = PIDTYPE_PID; type < PIDTYPE_MAX; ++type) {
+		INIT_HLIST_NODE(&task->pid_links[type]);
+	}
+}
+
 static inline void
 init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
 {
-	 task->pids[type].pid = pid;
+	if (type == PIDTYPE_PID)
+		task->thread_pid = pid;
+	else
+		task->signal->pids[type] = pid;
 }
 
 static inline void rcu_copy_process(struct task_struct *p)
@@ -1928,6 +1940,7 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
+	init_task_pid_links(p);
 	if (likely(p->pid)) {
 		ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
 
@@ -2036,13 +2049,13 @@ static __latent_entropy struct task_struct *copy_process(
 	return ERR_PTR(retval);
 }
 
-static inline void init_idle_pids(struct pid_link *links)
+static inline void init_idle_pids(struct task_struct *idle)
 {
 	enum pid_type type;
 
 	for (type = PIDTYPE_PID; type < PIDTYPE_MAX; ++type) {
-		INIT_HLIST_NODE(&links[type].node); /* not really needed */
-		links[type].pid = &init_struct_pid;
+		INIT_HLIST_NODE(&idle->pid_links[type]); /* not really needed */
+		init_task_pid(idle, type, &init_struct_pid);
 	}
 }
 
@@ -2052,7 +2065,7 @@ struct task_struct *fork_idle(int cpu)
 	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
 			    cpu_to_node(cpu));
 	if (!IS_ERR(task)) {
-		init_idle_pids(task->pids);
+		init_idle_pids(task);
 		init_idle(task, cpu);
 	}
 
diff --git a/kernel/pid.c b/kernel/pid.c
index d0de2b59f86f..f8486d2e2346 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -265,27 +265,35 @@ struct pid *find_vpid(int nr)
 }
 EXPORT_SYMBOL_GPL(find_vpid);
 
+static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
+{
+	return (type == PIDTYPE_PID) ?
+		&task->thread_pid :
+		(type == __PIDTYPE_TGID) ?
+		&task->signal->leader_pid :
+		&task->signal->pids[type];
+}
+
 /*
  * attach_pid() must be called with the tasklist_lock write-held.
  */
 void attach_pid(struct task_struct *task, enum pid_type type)
 {
-	struct pid_link *link = &task->pids[type];
-	hlist_add_head_rcu(&link->node, &link->pid->tasks[type]);
+	struct pid *pid = *task_pid_ptr(task, type);
+	hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
 }
 
 static void __change_pid(struct task_struct *task, enum pid_type type,
 			struct pid *new)
 {
-	struct pid_link *link;
+	struct pid **pid_ptr = task_pid_ptr(task, type);
 	struct pid *pid;
 	int tmp;
 
-	link = &task->pids[type];
-	pid = link->pid;
+	pid = *pid_ptr;
 
-	hlist_del_rcu(&link->node);
-	link->pid = new;
+	hlist_del_rcu(&task->pid_links[type]);
+	*pid_ptr = new;
 
 	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
 		if (!hlist_empty(&pid->tasks[tmp]))
@@ -310,8 +318,9 @@ void change_pid(struct task_struct *task, enum pid_type type,
 void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)
 {
-	new->pids[type].pid = old->pids[type].pid;
-	hlist_replace_rcu(&old->pids[type].node, &new->pids[type].node);
+	if (type == PIDTYPE_PID)
+		new->thread_pid = old->thread_pid;
+	hlist_replace_rcu(&old->pid_links[type], &new->pid_links[type]);
 }
 
 struct task_struct *pid_task(struct pid *pid, enum pid_type type)
@@ -322,7 +331,7 @@ struct task_struct *pid_task(struct pid *pid, enum pid_type type)
 		first = rcu_dereference_check(hlist_first_rcu(&pid->tasks[type]),
 					      lockdep_tasklist_lock_is_held());
 		if (first)
-			result = hlist_entry(first, struct task_struct, pids[(type)].node);
+			result = hlist_entry(first, struct task_struct, pid_links[(type)]);
 	}
 	return result;
 }
@@ -360,9 +369,7 @@ struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
 {
 	struct pid *pid;
 	rcu_read_lock();
-	if (type != PIDTYPE_PID)
-		task = task->group_leader;
-	pid = get_pid(rcu_dereference(task->pids[type].pid));
+	pid = get_pid(rcu_dereference(*task_pid_ptr(task, type)));
 	rcu_read_unlock();
 	return pid;
 }
@@ -420,16 +427,8 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 	rcu_read_lock();
 	if (!ns)
 		ns = task_active_pid_ns(current);
-	if (likely(pid_alive(task))) {
-		struct pid *pid;
-		if (type == PIDTYPE_PID)
-			pid = task_pid(task);
-		else if (type == __PIDTYPE_TGID)
-			pid = task_tgid(task);
-		else
-			pid = rcu_dereference(task->group_leader->pids[type].pid);
-		nr = pid_nr_ns(pid, ns);
-	}
+	if (likely(pid_alive(task)))
+		nr = pid_nr_ns(rcu_dereference(*task_pid_ptr(task, type)), ns);
 	rcu_read_unlock();
 
 	return nr;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 06/20] pid: Implement PIDTYPE_TGID
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (4 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 05/20] pids: Move the pgrp and session pid pointers from task_struct to signal_struct Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 07/20] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent Eric W. Biederman
                                             ` (16 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id).  Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array.  Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.

The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.

The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest.  The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/ia64/kernel/asm-offsets.c  | 2 +-
 arch/ia64/kernel/fsys.S         | 4 ++--
 arch/s390/kernel/perf_cpum_sf.c | 2 +-
 fs/exec.c                       | 1 +
 include/linux/pid.h             | 3 +--
 include/linux/sched.h           | 4 ++--
 include/linux/sched/signal.h    | 5 ++---
 init/init_task.c                | 2 +-
 kernel/events/core.c            | 2 +-
 kernel/exit.c                   | 1 +
 kernel/fork.c                   | 3 ++-
 kernel/pid.c                    | 2 --
 kernel/time/itimer.c            | 5 +++--
 kernel/time/posix-cpu-timers.c  | 2 +-
 14 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/kernel/asm-offsets.c b/arch/ia64/kernel/asm-offsets.c
index c1f8a57855af..00e8e2a1eb19 100644
--- a/arch/ia64/kernel/asm-offsets.c
+++ b/arch/ia64/kernel/asm-offsets.c
@@ -67,7 +67,7 @@ void foo(void)
 	DEFINE(IA64_SIGNAL_GROUP_STOP_COUNT_OFFSET,offsetof (struct signal_struct,
 							     group_stop_count));
 	DEFINE(IA64_SIGNAL_SHARED_PENDING_OFFSET,offsetof (struct signal_struct, shared_pending));
-	DEFINE(IA64_SIGNAL_LEADER_PID_OFFSET, offsetof (struct signal_struct, leader_pid));
+	DEFINE(IA64_SIGNAL_PIDS_TGID_OFFSET, offsetof (struct signal_struct, pids[PIDTYPE_TGID]));
 
 	BLANK();
 
diff --git a/arch/ia64/kernel/fsys.S b/arch/ia64/kernel/fsys.S
index e85ebdac678b..d80c99a5f55d 100644
--- a/arch/ia64/kernel/fsys.S
+++ b/arch/ia64/kernel/fsys.S
@@ -68,10 +68,10 @@ ENTRY(fsys_getpid)
 	add r9=TI_FLAGS+IA64_TASK_SIZE,r16
 	;;
 	ld4 r9=[r9]
-	add r17=IA64_SIGNAL_LEADER_PID_OFFSET,r17
+	add r17=IA64_SIGNAL_PIDS_TGID_OFFSET,r17
 	;;
 	and r9=TIF_ALLWORK_MASK,r9
-	ld8 r17=[r17]				// r17 = current->signal->leader_pid
+	ld8 r17=[r17]				// r17 = current->signal->pids[PIDTYPE_TGID]
 	;;
 	add r8=IA64_PID_LEVEL_OFFSET,r17
 	;;
diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c
index 0292d68e7dde..ca0b7ae894bb 100644
--- a/arch/s390/kernel/perf_cpum_sf.c
+++ b/arch/s390/kernel/perf_cpum_sf.c
@@ -665,7 +665,7 @@ static void cpumsf_output_event_pid(struct perf_event *event,
 		goto out;
 
 	/* Update the process ID (see also kernel/events/core.c) */
-	data->tid_entry.pid = cpumsf_pid_type(event, pid, __PIDTYPE_TGID);
+	data->tid_entry.pid = cpumsf_pid_type(event, pid, PIDTYPE_TGID);
 	data->tid_entry.tid = cpumsf_pid_type(event, pid, PIDTYPE_PID);
 
 	perf_output_sample(&handle, &header, data, event);
diff --git a/fs/exec.c b/fs/exec.c
index 2d4e0075bd24..79a11fbded7a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1146,6 +1146,7 @@ static int de_thread(struct task_struct *tsk)
 		 */
 		tsk->pid = leader->pid;
 		change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+		transfer_pid(leader, tsk, PIDTYPE_TGID);
 		transfer_pid(leader, tsk, PIDTYPE_PGID);
 		transfer_pid(leader, tsk, PIDTYPE_SID);
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3d4c504dcc8c..14a9a39da9c7 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -7,11 +7,10 @@
 enum pid_type
 {
 	PIDTYPE_PID,
+	PIDTYPE_TGID,
 	PIDTYPE_PGID,
 	PIDTYPE_SID,
 	PIDTYPE_MAX,
-	/* only valid to __task_pid_nr_ns() */
-	__PIDTYPE_TGID
 };
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 445bdf5b1f64..06b4e3bda93a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,12 +1275,12 @@ static inline pid_t task_session_vnr(struct task_struct *tsk)
 
 static inline pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
 {
-	return __task_pid_nr_ns(tsk, __PIDTYPE_TGID, ns);
+	return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
 }
 
 static inline pid_t task_tgid_vnr(struct task_struct *tsk)
 {
-	return __task_pid_nr_ns(tsk, __PIDTYPE_TGID, NULL);
+	return __task_pid_nr_ns(tsk, PIDTYPE_TGID, NULL);
 }
 
 static inline pid_t task_ppid_nr_ns(const struct task_struct *tsk, struct pid_namespace *ns)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2dcded16eb1e..ee30a5ba475f 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -147,7 +147,6 @@ struct signal_struct {
 #endif
 
 	/* PID/PID hash table linkage. */
-	struct pid *leader_pid;
 	struct pid *pids[PIDTYPE_MAX];
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -571,7 +570,7 @@ struct pid *task_pid_type(struct task_struct *task, enum pid_type type)
 
 static inline struct pid *task_tgid(struct task_struct *task)
 {
-	return task->signal->leader_pid;
+	return task->signal->pids[PIDTYPE_TGID];
 }
 
 /*
@@ -607,7 +606,7 @@ static inline bool thread_group_leader(struct task_struct *p)
  */
 static inline bool has_group_leader_pid(struct task_struct *p)
 {
-	return task_pid(p) == p->signal->leader_pid;
+	return task_pid(p) == task_tgid(p);
 }
 
 static inline
diff --git a/init/init_task.c b/init/init_task.c
index db12a61259f1..4f97846256d7 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -33,9 +33,9 @@ static struct signal_struct init_signals = {
 	},
 #endif
 	INIT_CPU_TIMERS(init_signals)
-	.leader_pid = &init_struct_pid,
 	.pids = {
 		[PIDTYPE_PID]	= &init_struct_pid,
+		[PIDTYPE_TGID]	= &init_struct_pid,
 		[PIDTYPE_PGID]	= &init_struct_pid,
 		[PIDTYPE_SID]	= &init_struct_pid,
 	},
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 80cca2b30c4f..9025b1796ca8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1334,7 +1334,7 @@ static u32 perf_event_pid_type(struct perf_event *event, struct task_struct *p,
 
 static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
 {
-	return perf_event_pid_type(event, p, __PIDTYPE_TGID);
+	return perf_event_pid_type(event, p, PIDTYPE_TGID);
 }
 
 static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
diff --git a/kernel/exit.c b/kernel/exit.c
index 16432428fc6c..25582b442955 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -73,6 +73,7 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 	nr_threads--;
 	detach_pid(p, PIDTYPE_PID);
 	if (group_dead) {
+		detach_pid(p, PIDTYPE_TGID);
 		detach_pid(p, PIDTYPE_PGID);
 		detach_pid(p, PIDTYPE_SID);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index d2952162399b..cc5be0d01ce6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1946,6 +1946,7 @@ static __latent_entropy struct task_struct *copy_process(
 
 		init_task_pid(p, PIDTYPE_PID, pid);
 		if (thread_group_leader(p)) {
+			init_task_pid(p, PIDTYPE_TGID, pid);
 			init_task_pid(p, PIDTYPE_PGID, task_pgrp(current));
 			init_task_pid(p, PIDTYPE_SID, task_session(current));
 
@@ -1954,7 +1955,6 @@ static __latent_entropy struct task_struct *copy_process(
 				p->signal->flags |= SIGNAL_UNKILLABLE;
 			}
 
-			p->signal->leader_pid = pid;
 			p->signal->tty = tty_kref_get(current->signal->tty);
 			/*
 			 * Inherit has_child_subreaper flag under the same
@@ -1965,6 +1965,7 @@ static __latent_entropy struct task_struct *copy_process(
 							 p->real_parent->signal->is_child_subreaper;
 			list_add_tail(&p->sibling, &p->real_parent->children);
 			list_add_tail_rcu(&p->tasks, &init_task.tasks);
+			attach_pid(p, PIDTYPE_TGID);
 			attach_pid(p, PIDTYPE_PGID);
 			attach_pid(p, PIDTYPE_SID);
 			__this_cpu_inc(process_counts);
diff --git a/kernel/pid.c b/kernel/pid.c
index f8486d2e2346..de1cfc4f75a2 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -269,8 +269,6 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
 {
 	return (type == PIDTYPE_PID) ?
 		&task->thread_pid :
-		(type == __PIDTYPE_TGID) ?
-		&task->signal->leader_pid :
 		&task->signal->pids[type];
 }
 
diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c
index f26acef5d7b4..9a65713c8309 100644
--- a/kernel/time/itimer.c
+++ b/kernel/time/itimer.c
@@ -139,9 +139,10 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
 {
 	struct signal_struct *sig =
 		container_of(timer, struct signal_struct, real_timer);
+	struct pid *leader_pid = sig->pids[PIDTYPE_TGID];
 
-	trace_itimer_expire(ITIMER_REAL, sig->leader_pid, 0);
-	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);
+	trace_itimer_expire(ITIMER_REAL, leader_pid, 0);
+	kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid);
 
 	return HRTIMER_NORESTART;
 }
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 5a6251ac6f7a..40e6fae46cec 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -895,7 +895,7 @@ static void check_cpu_itimer(struct task_struct *tsk, struct cpu_itimer *it,
 
 		trace_itimer_expire(signo == SIGPROF ?
 				    ITIMER_PROF : ITIMER_VIRTUAL,
-				    tsk->signal->leader_pid, cur_time);
+				    task_tgid(tsk), cur_time);
 		__group_send_sig_info(signo, SEND_SIG_PRIV, tsk);
 	}
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 07/20] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (5 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 06/20] pid: Implement PIDTYPE_TGID Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-08-16  4:04                             ` [PATCH] signal: Don't send signals to tasks that don't exist Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 08/20] posix-timers: Noralize good_sigevent Eric W. Biederman
                                             ` (15 subsequent siblings)
  22 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

When f_setown is called a pid and a pid type are stored.  Replace the use
of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
group.  Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
is only for a thread.

Update the users of __f_setown to use PIDTYPE_TGID instead of
PIDTYPE_PID.

For now the code continues to capture task_pid (when task_tgid would
really be appropriate), and iterate on PIDTYPE_PID (even when type ==
PIDTYPE_TGID) out of an abundance of caution to preserve existing
behavior.

Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
for tgid lookup also be used to avoid taking the tasklist lock.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/net/tun.c           |  2 +-
 drivers/tty/tty_io.c        |  2 +-
 fs/fcntl.c                  | 54 ++++++++++++++++++++++---------------
 fs/locks.c                  |  2 +-
 fs/notify/dnotify/dnotify.c |  3 ++-
 5 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a192a017cc68..9958b70ac1b0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3216,7 +3216,7 @@ static int tun_chr_fasync(int fd, struct file *file, int on)
 		goto out;
 
 	if (on) {
-		__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+		__f_setown(file, task_pid(current), PIDTYPE_TGID, 0);
 		tfile->flags |= TUN_FASYNC;
 	} else
 		tfile->flags &= ~TUN_FASYNC;
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index aba59521ad48..090fb7e78eea 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2122,7 +2122,7 @@ static int __tty_fasync(int fd, struct file *filp, int on)
 			type = PIDTYPE_PGID;
 		} else {
 			pid = task_pid(current);
-			type = PIDTYPE_PID;
+			type = PIDTYPE_TGID;
 		}
 		get_pid(pid);
 		spin_unlock_irqrestore(&tty->ctrl_lock, flags);
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 12273b6ea56d..1523588fd759 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -116,7 +116,7 @@ int f_setown(struct file *filp, unsigned long arg, int force)
 	struct pid *pid = NULL;
 	int who = arg, ret = 0;
 
-	type = PIDTYPE_PID;
+	type = PIDTYPE_TGID;
 	if (who < 0) {
 		/* avoid overflow below */
 		if (who == INT_MIN)
@@ -143,7 +143,7 @@ EXPORT_SYMBOL(f_setown);
 
 void f_delown(struct file *filp)
 {
-	f_modown(filp, NULL, PIDTYPE_PID, 1);
+	f_modown(filp, NULL, PIDTYPE_TGID, 1);
 }
 
 pid_t f_getown(struct file *filp)
@@ -171,11 +171,11 @@ static int f_setown_ex(struct file *filp, unsigned long arg)
 
 	switch (owner.type) {
 	case F_OWNER_TID:
-		type = PIDTYPE_MAX;
+		type = PIDTYPE_PID;
 		break;
 
 	case F_OWNER_PID:
-		type = PIDTYPE_PID;
+		type = PIDTYPE_TGID;
 		break;
 
 	case F_OWNER_PGRP:
@@ -206,11 +206,11 @@ static int f_getown_ex(struct file *filp, unsigned long arg)
 	read_lock(&filp->f_owner.lock);
 	owner.pid = pid_vnr(filp->f_owner.pid);
 	switch (filp->f_owner.pid_type) {
-	case PIDTYPE_MAX:
+	case PIDTYPE_PID:
 		owner.type = F_OWNER_TID;
 		break;
 
-	case PIDTYPE_PID:
+	case PIDTYPE_TGID:
 		owner.type = F_OWNER_PID;
 		break;
 
@@ -785,20 +785,25 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 	read_lock(&fown->lock);
 
 	type = fown->pid_type;
-	if (type == PIDTYPE_MAX) {
+	if (type == PIDTYPE_PID)
 		group = 0;
-		type = PIDTYPE_PID;
-	}
 
 	pid = fown->pid;
 	if (!pid)
 		goto out_unlock_fown;
-	
-	read_lock(&tasklist_lock);
-	do_each_pid_task(pid, type, p) {
+
+	if (type <= PIDTYPE_TGID) {
+		rcu_read_lock();
+		p = pid_task(pid, PIDTYPE_PID);
 		send_sigio_to_task(p, fown, fd, band, group);
-	} while_each_pid_task(pid, type, p);
-	read_unlock(&tasklist_lock);
+		rcu_read_unlock();
+	} else {
+		read_lock(&tasklist_lock);
+		do_each_pid_task(pid, type, p) {
+			send_sigio_to_task(p, fown, fd, band, group);
+		} while_each_pid_task(pid, type, p);
+		read_unlock(&tasklist_lock);
+	}
  out_unlock_fown:
 	read_unlock(&fown->lock);
 }
@@ -821,22 +826,27 @@ int send_sigurg(struct fown_struct *fown)
 	read_lock(&fown->lock);
 
 	type = fown->pid_type;
-	if (type == PIDTYPE_MAX) {
+	if (type == PIDTYPE_PID)
 		group = 0;
-		type = PIDTYPE_PID;
-	}
 
 	pid = fown->pid;
 	if (!pid)
 		goto out_unlock_fown;
 
 	ret = 1;
-	
-	read_lock(&tasklist_lock);
-	do_each_pid_task(pid, type, p) {
+
+	if (type <= PIDTYPE_TGID) {
+		rcu_read_lock();
+		p = pid_task(pid, PIDTYPE_PID);
 		send_sigurg_to_task(p, fown, group);
-	} while_each_pid_task(pid, type, p);
-	read_unlock(&tasklist_lock);
+		rcu_read_unlock();
+	} else {
+		read_lock(&tasklist_lock);
+		do_each_pid_task(pid, type, p) {
+			send_sigurg_to_task(p, fown, group);
+		} while_each_pid_task(pid, type, p);
+		read_unlock(&tasklist_lock);
+	}
  out_unlock_fown:
 	read_unlock(&fown->lock);
 	return ret;
diff --git a/fs/locks.c b/fs/locks.c
index db7b6917d9c5..cfc059bda8ea 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -546,7 +546,7 @@ lease_setup(struct file_lock *fl, void **priv)
 	if (!fasync_insert_entry(fa->fa_fd, filp, &fl->fl_fasync, fa))
 		*priv = NULL;
 
-	__f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
+	__f_setown(filp, task_pid(current), PIDTYPE_TGID, 0);
 }
 
 static const struct lock_manager_operations lease_manager_ops = {
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index e2bea2ac5dfb..484f2c3a33bb 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -19,6 +19,7 @@
 #include <linux/fs.h>
 #include <linux/module.h>
 #include <linux/sched.h>
+#include <linux/sched/signal.h>
 #include <linux/dnotify.h>
 #include <linux/init.h>
 #include <linux/spinlock.h>
@@ -353,7 +354,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 		goto out;
 	}
 
-	__f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
+	__f_setown(filp, task_pid(current), PIDTYPE_TGID, 0);
 
 	error = attach_dn(dn, dn_mark, id, fd, filp, mask);
 	/* !error means that we attached the dn to the dn_mark, so don't free it */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 08/20] posix-timers: Noralize good_sigevent
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (6 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 07/20] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 09/20] signal: Pass pid and pid type into send_sigqueue Eric W. Biederman
                                             ` (14 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

In good_sigevent directly compute the default return value as
"task_tgid(current)".  This is exactly the same as
"task_pid(current->group_leader)" but written more clearly.

In the thread case first compute the thread's pid.  Then veify that
attached to that pid is a thread of the current thread group.

This has the net effect of making the code a little clearer, and
making it obvious that posix timers never look up a process by a the
pid of a thread.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/time/posix-timers.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index e08ce3f27447..2bdf08a2bae9 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -433,11 +433,13 @@ static enum hrtimer_restart posix_timer_fn(struct hrtimer *timer)
 
 static struct pid *good_sigevent(sigevent_t * event)
 {
-	struct task_struct *rtn = current->group_leader;
+	struct pid *pid = task_tgid(current);
+	struct task_struct *rtn;
 
 	switch (event->sigev_notify) {
 	case SIGEV_SIGNAL | SIGEV_THREAD_ID:
-		rtn = find_task_by_vpid(event->sigev_notify_thread_id);
+		pid = find_vpid(event->sigev_notify_thread_id);
+		rtn = pid_task(pid, PIDTYPE_PID);
 		if (!rtn || !same_thread_group(rtn, current))
 			return NULL;
 		/* FALLTHRU */
@@ -447,7 +449,7 @@ static struct pid *good_sigevent(sigevent_t * event)
 			return NULL;
 		/* FALLTHRU */
 	case SIGEV_NONE:
-		return task_pid(rtn);
+		return pid;
 	default:
 		return NULL;
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 09/20] signal: Pass pid and pid type into send_sigqueue
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (7 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 08/20] posix-timers: Noralize good_sigevent Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 10/20] signal: Pass pid type into group_send_sig_info Eric W. Biederman
                                             ` (13 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Make the code more maintainable by performing more of the signal
related work in send_sigqueue.

A quick inspection of do_timer_create will show that this code path
does not lookup a thread group by a thread's pid.  Making it safe
to find the task pointed to by it_pid with "pid_task(it_pid, type)";

This supports the changes needed in fork to tell if a signal was sent
to a single process or a group of processes.

Having the pid to task transition in signal.c will also make it easier
to sort out races with de_thread and and the thread group leader
exiting when it comes time to address that.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  2 +-
 kernel/signal.c              | 14 +++++++++-----
 kernel/time/posix-timers.c   | 13 ++++---------
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index ee30a5ba475f..94558ffa82ab 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -330,7 +330,7 @@ extern int send_sig(int, struct task_struct *, int);
 extern int zap_other_threads(struct task_struct *p);
 extern struct sigqueue *sigqueue_alloc(void);
 extern void sigqueue_free(struct sigqueue *);
-extern int send_sigqueue(struct sigqueue *,  struct task_struct *, int group);
+extern int send_sigqueue(struct sigqueue *, struct pid *, enum pid_type);
 extern int do_sigaction(int, struct k_sigaction *, struct k_sigaction *);
 
 static inline int restart_syscall(void)
diff --git a/kernel/signal.c b/kernel/signal.c
index 8d8a940422a8..40feb14e276d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1664,17 +1664,20 @@ void sigqueue_free(struct sigqueue *q)
 		__sigqueue_free(q);
 }
 
-int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
+int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
 {
 	int sig = q->info.si_signo;
 	struct sigpending *pending;
+	struct task_struct *t;
 	unsigned long flags;
 	int ret, result;
 
 	BUG_ON(!(q->flags & SIGQUEUE_PREALLOC));
 
 	ret = -1;
-	if (!likely(lock_task_sighand(t, &flags)))
+	rcu_read_lock();
+	t = pid_task(pid, type);
+	if (!t || !likely(lock_task_sighand(t, &flags)))
 		goto ret;
 
 	ret = 1; /* the signal is ignored */
@@ -1696,15 +1699,16 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
 	q->info.si_overrun = 0;
 
 	signalfd_notify(t, sig);
-	pending = group ? &t->signal->shared_pending : &t->pending;
+	pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
 	list_add_tail(&q->list, &pending->list);
 	sigaddset(&pending->signal, sig);
-	complete_signal(sig, t, group);
+	complete_signal(sig, t, type != PIDTYPE_PID);
 	result = TRACE_SIGNAL_DELIVERED;
 out:
-	trace_signal_generate(sig, &q->info, t, group, result);
+	trace_signal_generate(sig, &q->info, t, type != PIDTYPE_PID, result);
 	unlock_task_sighand(t, &flags);
 ret:
+	rcu_read_unlock();
 	return ret;
 }
 
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 2bdf08a2bae9..2d2e739fbc57 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -332,8 +332,8 @@ void posixtimer_rearm(struct siginfo *info)
 
 int posix_timer_event(struct k_itimer *timr, int si_private)
 {
-	struct task_struct *task;
-	int shared, ret = -1;
+	enum pid_type type;
+	int ret = -1;
 	/*
 	 * FIXME: if ->sigq is queued we can race with
 	 * dequeue_signal()->posixtimer_rearm().
@@ -347,13 +347,8 @@ int posix_timer_event(struct k_itimer *timr, int si_private)
 	 */
 	timr->sigq->info.si_sys_private = si_private;
 
-	rcu_read_lock();
-	task = pid_task(timr->it_pid, PIDTYPE_PID);
-	if (task) {
-		shared = !(timr->it_sigev_notify & SIGEV_THREAD_ID);
-		ret = send_sigqueue(timr->sigq, task, shared);
-	}
-	rcu_read_unlock();
+	type = !(timr->it_sigev_notify & SIGEV_THREAD_ID) ? PIDTYPE_TGID : PIDTYPE_PID;
+	ret = send_sigqueue(timr->sigq, timr->it_pid, type);
 	/* If we failed to send the signal the timer stops. */
 	return ret > 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 10/20] signal: Pass pid type into group_send_sig_info
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (8 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 09/20] signal: Pass pid and pid type into send_sigqueue Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 11/20] signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task Eric W. Biederman
                                             ` (12 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

This passes the information we already have at the call sight
into group_send_sig_info.  Ultimatelly allowing for to better handle
signals sent to a group of processes.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/signal.h |  4 +++-
 kernel/exit.c          |  3 ++-
 kernel/signal.c        | 10 ++++++----
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 3c5200137b24..d8f2bf3d41e6 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -254,11 +254,13 @@ static inline int valid_signal(unsigned long sig)
 
 struct timespec;
 struct pt_regs;
+enum pid_type;
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
 extern int do_send_sig_info(int sig, struct siginfo *info,
 				struct task_struct *p, bool group);
-extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p);
+extern int group_send_sig_info(int sig, struct siginfo *info,
+			       struct task_struct *p, enum pid_type type);
 extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
 extern void set_current_blocked(sigset_t *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 25582b442955..0e21e6d21f35 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -681,7 +681,8 @@ static void forget_original_parent(struct task_struct *father,
 				t->parent = t->real_parent;
 			if (t->pdeath_signal)
 				group_send_sig_info(t->pdeath_signal,
-						    SEND_SIG_NOINFO, t);
+						    SEND_SIG_NOINFO, t,
+						    PIDTYPE_TGID);
 		}
 		/*
 		 * If this is a threaded reparent there is no need to
diff --git a/kernel/signal.c b/kernel/signal.c
index 40feb14e276d..c7527338fe9d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1274,7 +1274,8 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
 /*
  * send signal info to all the members of a group
  */
-int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
+int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
+			enum pid_type type)
 {
 	int ret;
 
@@ -1301,7 +1302,7 @@ int __kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp)
 	success = 0;
 	retval = -ESRCH;
 	do_each_pid_task(pgrp, PIDTYPE_PGID, p) {
-		int err = group_send_sig_info(sig, info, p);
+		int err = group_send_sig_info(sig, info, p, PIDTYPE_PGID);
 		success |= !err;
 		retval = err;
 	} while_each_pid_task(pgrp, PIDTYPE_PGID, p);
@@ -1317,7 +1318,7 @@ int kill_pid_info(int sig, struct siginfo *info, struct pid *pid)
 		rcu_read_lock();
 		p = pid_task(pid, PIDTYPE_PID);
 		if (p)
-			error = group_send_sig_info(sig, info, p);
+			error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
 		rcu_read_unlock();
 		if (likely(!p || error != -ESRCH))
 			return error;
@@ -1420,7 +1421,8 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid)
 		for_each_process(p) {
 			if (task_pid_vnr(p) > 1 &&
 					!same_thread_group(p, current)) {
-				int err = group_send_sig_info(sig, info, p);
+				int err = group_send_sig_info(sig, info, p,
+							      PIDTYPE_MAX);
 				++count;
 				if (err != -EPERM)
 					retval = err;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 11/20] signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (9 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 10/20] signal: Pass pid type into group_send_sig_info Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 12/20] signal: Pass pid type into do_send_sig_info Eric W. Biederman
                                             ` (11 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

This information is already present and using it directly simplifies the logic
of the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fcntl.c | 26 +++++++++-----------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 1523588fd759..5d596a00f40b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -723,7 +723,7 @@ static inline int sigio_perm(struct task_struct *p,
 
 static void send_sigio_to_task(struct task_struct *p,
 			       struct fown_struct *fown,
-			       int fd, int reason, int group)
+			       int fd, int reason, enum pid_type type)
 {
 	/*
 	 * F_SETSIG can change ->signum lockless in parallel, make
@@ -767,11 +767,11 @@ static void send_sigio_to_task(struct task_struct *p,
 			else
 				si.si_band = mangle_poll(band_table[reason - POLL_IN]);
 			si.si_fd    = fd;
-			if (!do_send_sig_info(signum, &si, p, group))
+			if (!do_send_sig_info(signum, &si, p, type != PIDTYPE_PID))
 				break;
 		/* fall-through: fall back on the old plain SIGIO signal */
 		case 0:
-			do_send_sig_info(SIGIO, SEND_SIG_PRIV, p, group);
+			do_send_sig_info(SIGIO, SEND_SIG_PRIV, p, type != PIDTYPE_PID);
 	}
 }
 
@@ -780,14 +780,10 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 	struct task_struct *p;
 	enum pid_type type;
 	struct pid *pid;
-	int group = 1;
 	
 	read_lock(&fown->lock);
 
 	type = fown->pid_type;
-	if (type == PIDTYPE_PID)
-		group = 0;
-
 	pid = fown->pid;
 	if (!pid)
 		goto out_unlock_fown;
@@ -795,12 +791,12 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 	if (type <= PIDTYPE_TGID) {
 		rcu_read_lock();
 		p = pid_task(pid, PIDTYPE_PID);
-		send_sigio_to_task(p, fown, fd, band, group);
+		send_sigio_to_task(p, fown, fd, band, type);
 		rcu_read_unlock();
 	} else {
 		read_lock(&tasklist_lock);
 		do_each_pid_task(pid, type, p) {
-			send_sigio_to_task(p, fown, fd, band, group);
+			send_sigio_to_task(p, fown, fd, band, type);
 		} while_each_pid_task(pid, type, p);
 		read_unlock(&tasklist_lock);
 	}
@@ -809,10 +805,10 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 }
 
 static void send_sigurg_to_task(struct task_struct *p,
-				struct fown_struct *fown, int group)
+				struct fown_struct *fown, enum pid_type type)
 {
 	if (sigio_perm(p, fown, SIGURG))
-		do_send_sig_info(SIGURG, SEND_SIG_PRIV, p, group);
+		do_send_sig_info(SIGURG, SEND_SIG_PRIV, p, type != PIDTYPE_PID);
 }
 
 int send_sigurg(struct fown_struct *fown)
@@ -820,15 +816,11 @@ int send_sigurg(struct fown_struct *fown)
 	struct task_struct *p;
 	enum pid_type type;
 	struct pid *pid;
-	int group = 1;
 	int ret = 0;
 	
 	read_lock(&fown->lock);
 
 	type = fown->pid_type;
-	if (type == PIDTYPE_PID)
-		group = 0;
-
 	pid = fown->pid;
 	if (!pid)
 		goto out_unlock_fown;
@@ -838,12 +830,12 @@ int send_sigurg(struct fown_struct *fown)
 	if (type <= PIDTYPE_TGID) {
 		rcu_read_lock();
 		p = pid_task(pid, PIDTYPE_PID);
-		send_sigurg_to_task(p, fown, group);
+		send_sigurg_to_task(p, fown, type);
 		rcu_read_unlock();
 	} else {
 		read_lock(&tasklist_lock);
 		do_each_pid_task(pid, type, p) {
-			send_sigurg_to_task(p, fown, group);
+			send_sigurg_to_task(p, fown, type);
 		} while_each_pid_task(pid, type, p);
 		read_unlock(&tasklist_lock);
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 12/20] signal: Pass pid type into do_send_sig_info
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (10 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 11/20] signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 13/20] signal: Push pid type down into send_signal Eric W. Biederman
                                             ` (10 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

This passes the information we already have at the call sight into
do_send_sig_info.  Ultimately allowing for better handling of signals
sent to a group of processes during fork.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/tty/sysrq.c    |  2 +-
 fs/fcntl.c             |  6 +++---
 include/linux/signal.h |  2 +-
 kernel/signal.c        | 10 +++++-----
 mm/oom_kill.c          |  4 ++--
 5 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 6364890575ec..06ed20dd01ba 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -348,7 +348,7 @@ static void send_sig_all(int sig)
 		if (is_global_init(p))
 			continue;
 
-		do_send_sig_info(sig, SEND_SIG_FORCED, p, true);
+		do_send_sig_info(sig, SEND_SIG_FORCED, p, PIDTYPE_MAX);
 	}
 	read_unlock(&tasklist_lock);
 }
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 5d596a00f40b..a04accf6847f 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -767,11 +767,11 @@ static void send_sigio_to_task(struct task_struct *p,
 			else
 				si.si_band = mangle_poll(band_table[reason - POLL_IN]);
 			si.si_fd    = fd;
-			if (!do_send_sig_info(signum, &si, p, type != PIDTYPE_PID))
+			if (!do_send_sig_info(signum, &si, p, type))
 				break;
 		/* fall-through: fall back on the old plain SIGIO signal */
 		case 0:
-			do_send_sig_info(SIGIO, SEND_SIG_PRIV, p, type != PIDTYPE_PID);
+			do_send_sig_info(SIGIO, SEND_SIG_PRIV, p, type);
 	}
 }
 
@@ -808,7 +808,7 @@ static void send_sigurg_to_task(struct task_struct *p,
 				struct fown_struct *fown, enum pid_type type)
 {
 	if (sigio_perm(p, fown, SIGURG))
-		do_send_sig_info(SIGURG, SEND_SIG_PRIV, p, type != PIDTYPE_PID);
+		do_send_sig_info(SIGURG, SEND_SIG_PRIV, p, type);
 }
 
 int send_sigurg(struct fown_struct *fown)
diff --git a/include/linux/signal.h b/include/linux/signal.h
index d8f2bf3d41e6..fe125b0335f7 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -258,7 +258,7 @@ enum pid_type;
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
 extern int do_send_sig_info(int sig, struct siginfo *info,
-				struct task_struct *p, bool group);
+				struct task_struct *p, enum pid_type type);
 extern int group_send_sig_info(int sig, struct siginfo *info,
 			       struct task_struct *p, enum pid_type type);
 extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *);
diff --git a/kernel/signal.c b/kernel/signal.c
index c7527338fe9d..2c09e6143dd8 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1161,13 +1161,13 @@ specific_send_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 }
 
 int do_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
-			bool group)
+			enum pid_type type)
 {
 	unsigned long flags;
 	int ret = -ESRCH;
 
 	if (lock_task_sighand(p, &flags)) {
-		ret = send_signal(sig, info, p, group);
+		ret = send_signal(sig, info, p, type != PIDTYPE_PID);
 		unlock_task_sighand(p, &flags);
 	}
 
@@ -1284,7 +1284,7 @@ int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
 	rcu_read_unlock();
 
 	if (!ret && sig)
-		ret = do_send_sig_info(sig, info, p, true);
+		ret = do_send_sig_info(sig, info, p, type);
 
 	return ret;
 }
@@ -1448,7 +1448,7 @@ int send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	if (!valid_signal(sig))
 		return -EINVAL;
 
-	return do_send_sig_info(sig, info, p, false);
+	return do_send_sig_info(sig, info, p, PIDTYPE_PID);
 }
 
 #define __si_special(priv) \
@@ -3199,7 +3199,7 @@ do_send_specific(pid_t tgid, pid_t pid, int sig, struct siginfo *info)
 		 * probe.  No signal is actually delivered.
 		 */
 		if (!error && sig) {
-			error = do_send_sig_info(sig, info, p, false);
+			error = do_send_sig_info(sig, info, p, PIDTYPE_PID);
 			/*
 			 * If lock_task_sighand() failed we pretend the task
 			 * dies after receiving the signal. The window is tiny,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..2cc9b238368f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -920,7 +920,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 * in order to prevent the OOM victim from depleting the memory
 	 * reserves from the user space under its control.
 	 */
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, PIDTYPE_TGID);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -958,7 +958,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		 */
 		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
 	}
 	rcu_read_unlock();
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 13/20] signal: Push pid type down into send_signal
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (11 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 12/20] signal: Pass pid type into do_send_sig_info Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 14/20] signal: Push pid type down into __send_signal Eric W. Biederman
                                             ` (9 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

This information is already available in the callers and by pushing it
down it makes the code a little clearer, and allows better group
signal behavior in fork.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/signal.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 2c09e6143dd8..8decc70c1dc2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1103,7 +1103,7 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 }
 
 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
-			int group)
+			enum pid_type type)
 {
 	int from_ancestor_ns = 0;
 
@@ -1112,7 +1112,7 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			   !task_pid_nr_ns(current, task_active_pid_ns(t));
 #endif
 
-	return __send_signal(sig, info, t, group, from_ancestor_ns);
+	return __send_signal(sig, info, t, type != PIDTYPE_PID, from_ancestor_ns);
 }
 
 static void print_fatal_signal(int signr)
@@ -1151,13 +1151,13 @@ __setup("print-fatal-signals=", setup_print_fatal_signals);
 int
 __group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 {
-	return send_signal(sig, info, p, 1);
+	return send_signal(sig, info, p, PIDTYPE_TGID);
 }
 
 static int
 specific_send_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 {
-	return send_signal(sig, info, t, 0);
+	return send_signal(sig, info, t, PIDTYPE_PID);
 }
 
 int do_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
@@ -1167,7 +1167,7 @@ int do_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
 	int ret = -ESRCH;
 
 	if (lock_task_sighand(p, &flags)) {
-		ret = send_signal(sig, info, p, type != PIDTYPE_PID);
+		ret = send_signal(sig, info, p, type);
 		unlock_task_sighand(p, &flags);
 	}
 
@@ -3966,7 +3966,7 @@ void kdb_send_sig(struct task_struct *t, int sig)
 			   "the deadlock.\n");
 		return;
 	}
-	ret = send_signal(sig, SEND_SIG_PRIV, t, false);
+	ret = send_signal(sig, SEND_SIG_PRIV, t, PIDTYPE_PID);
 	spin_unlock(&t->sighand->siglock);
 	if (ret)
 		kdb_printf("Fail to deliver Signal %d to process %d.\n",
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 14/20] signal: Push pid type down into __send_signal
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (12 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 13/20] signal: Push pid type down into send_signal Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 15/20] signal: Push pid type down into complete_signal Eric W. Biederman
                                             ` (8 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

This information is already available in the callers and by pushing it
down it makes the code a little clearer, and allows implementing
better handling of signales set to a group of processes in fork.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/signal.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8decc70c1dc2..1ef94303d87a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -998,7 +998,7 @@ static inline void userns_fixup_signal_uid(struct siginfo *info, struct task_str
 #endif
 
 static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
-			int group, int from_ancestor_ns)
+			enum pid_type type, int from_ancestor_ns)
 {
 	struct sigpending *pending;
 	struct sigqueue *q;
@@ -1012,7 +1012,7 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			from_ancestor_ns || (info == SEND_SIG_FORCED)))
 		goto ret;
 
-	pending = group ? &t->signal->shared_pending : &t->pending;
+	pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
 	/*
 	 * Short-circuit ignored signals and support queuing
 	 * exactly one non-rt signal, so that we can get more
@@ -1096,9 +1096,9 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
-	complete_signal(sig, t, group);
+	complete_signal(sig, t, type != PIDTYPE_PID);
 ret:
-	trace_signal_generate(sig, info, t, group, result);
+	trace_signal_generate(sig, info, t, type != PIDTYPE_PID, result);
 	return ret;
 }
 
@@ -1112,7 +1112,7 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			   !task_pid_nr_ns(current, task_active_pid_ns(t));
 #endif
 
-	return __send_signal(sig, info, t, type != PIDTYPE_PID, from_ancestor_ns);
+	return __send_signal(sig, info, t, type, from_ancestor_ns);
 }
 
 static void print_fatal_signal(int signr)
@@ -1377,7 +1377,7 @@ int kill_pid_info_as_cred(int sig, struct siginfo *info, struct pid *pid,
 
 	if (sig) {
 		if (lock_task_sighand(p, &flags)) {
-			ret = __send_signal(sig, info, p, 1, 0);
+			ret = __send_signal(sig, info, p, PIDTYPE_TGID, 0);
 			unlock_task_sighand(p, &flags);
 		} else
 			ret = -ESRCH;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 15/20] signal: Push pid type down into complete_signal.
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (13 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 14/20] signal: Push pid type down into __send_signal Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 16/20] fork: Move and describe why the code examines PIDNS_ADDING Eric W. Biederman
                                             ` (7 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

This is the bottom and by pushing this down it simplifies the callers
and otherwise leaves things as is.  This is in preparation for allowing
fork to implement better handling of signals set to groups of processes.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/signal.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 1ef94303d87a..dddbea558455 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -895,7 +895,7 @@ static inline int wants_signal(int sig, struct task_struct *p)
 	return task_curr(p) || !signal_pending(p);
 }
 
-static void complete_signal(int sig, struct task_struct *p, int group)
+static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 {
 	struct signal_struct *signal = p->signal;
 	struct task_struct *t;
@@ -908,7 +908,7 @@ static void complete_signal(int sig, struct task_struct *p, int group)
 	 */
 	if (wants_signal(sig, p))
 		t = p;
-	else if (!group || thread_group_empty(p))
+	else if ((type == PIDTYPE_PID) || thread_group_empty(p))
 		/*
 		 * There is just one thread and it does not need to be woken.
 		 * It will dequeue unblocked signals before it runs again.
@@ -1096,7 +1096,7 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
-	complete_signal(sig, t, type != PIDTYPE_PID);
+	complete_signal(sig, t, type);
 ret:
 	trace_signal_generate(sig, info, t, type != PIDTYPE_PID, result);
 	return ret;
@@ -1704,7 +1704,7 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
 	pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
 	list_add_tail(&q->list, &pending->list);
 	sigaddset(&pending->signal, sig);
-	complete_signal(sig, t, type != PIDTYPE_PID);
+	complete_signal(sig, t, type);
 	result = TRACE_SIGNAL_DELIVERED;
 out:
 	trace_signal_generate(sig, &q->info, t, type != PIDTYPE_PID, result);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 16/20] fork: Move and describe why the code examines PIDNS_ADDING
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (14 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 15/20] signal: Push pid type down into complete_signal Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 17/20] fork: Unconditionally exit if a fatal signal is pending Eric W. Biederman
                                             ` (6 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Normally this would be something that would be handled by handling
signals that are sent to a group of processes but in this case the
forking process is not a member of the group being signaled.  Thus
special code is needed to prevent a race with pid namespaces exiting,
and fork adding new processes within them.

Move this test up before the signal restart just in case signals are
also pending.  Fatal conditions should take presedence over restarts.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index cc5be0d01ce6..b9c54318a292 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1922,6 +1922,12 @@ static __latent_entropy struct task_struct *copy_process(
 
 	rseq_fork(p, clone_flags);
 
+	/* Don't start children in a dying pid namespace */
+	if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
+		retval = -ENOMEM;
+		goto bad_fork_cancel_cgroup;
+	}
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
@@ -1935,10 +1941,7 @@ static __latent_entropy struct task_struct *copy_process(
 		retval = -ERESTARTNOINTR;
 		goto bad_fork_cancel_cgroup;
 	}
-	if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
-		retval = -ENOMEM;
-		goto bad_fork_cancel_cgroup;
-	}
+
 
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 17/20] fork: Unconditionally exit if a fatal signal is pending
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (15 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 16/20] fork: Move and describe why the code examines PIDNS_ADDING Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 18/20] signal: Add calculate_sigpending() Eric W. Biederman
                                             ` (5 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

In practice this does not change anything as testing for fatal_signal_pending
and exiting for with an error code duplicates the work of the next clause
which recalculates pending signals and then exits fork if any are pending.
In both cases the pending signal will trigger the slow path when existing
to userspace, and the fatal signal will cause do_exit to be called.

The advantage of making this a separate test is that it makes it clear
processing the fatal signal will terminate the fork, and it allows the
rest of the signal logic to be updated without fear that this important
case will be lost.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index b9c54318a292..22d4cdb9a7ca 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1928,6 +1928,12 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
+	/* Let kill terminate clone/fork in the middle */
+	if (fatal_signal_pending(current)) {
+		retval = -EINTR;
+		goto bad_fork_cancel_cgroup;
+	}
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 18/20] signal: Add calculate_sigpending()
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (16 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 17/20] fork: Unconditionally exit if a fatal signal is pending Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-26 13:20                             ` Oleg Nesterov
  2018-07-24  3:24                           ` [PATCH 19/20] fork: Have new threads join on-going signal group stops Eric W. Biederman
                                             ` (4 subsequent siblings)
  22 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Add a function calculate_sigpending to test to see if any signals are
pending for a new task immediately following fork.  Signals have to
happen either before or after fork.  Today our practice is to push
all of the signals to before the fork, but that has the downside that
frequent or periodic signals can make fork take much much longer than
normal or prevent fork from completing entirely.

So we need move signals that we can after the fork to prevent that.

This updates the code to set TIF_SIGPENDING on a new task if there
are signals or other activities that have moved so that they appear
to happen after the fork.

As the code today restarts if it sees any such activity this won't
immediately have an effect, as there will be no reason for it
to set TIF_SIGPENDING immediately after the fork.

Adding calculate_sigpending means the code in fork can safely be
changed to not always restart if a signal is pending.

The new calculate_sigpending function sets sigpending if there
are pending bits in jobctl, pending signals, the freezer needs
to freeze the new task or the live kernel patching framework
need the new thread to take the slow path to userspace.

I have verified that setting TIF_SIGPENDING does make a new process
take the slow path to userspace before it executes it's first userspace
instruction.

I have looked at the callers of signal_wake_up and the code paths
setting TIF_SIGPENDING and I don't see else that needs to be handled.
The code probably doesn't need to set TIF_SIGPENDING for the kernel
live patching as it uses a separate thread flag as well.  But at this
point it seems safer to copy recalc_sigpending and get the kernel live
patching folks to sort out their story later.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  1 +
 kernel/fork.c                |  1 +
 kernel/signal.c              | 13 +++++++++++++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 94558ffa82ab..7cabc0bc38f6 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -372,6 +372,7 @@ static inline int signal_pending_state(long state, struct task_struct *p)
  */
 extern void recalc_sigpending_and_wake(struct task_struct *t);
 extern void recalc_sigpending(void);
+extern void calculate_sigpending(struct task_struct *new);
 
 extern void signal_wake_up_state(struct task_struct *t, unsigned int state);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 22d4cdb9a7ca..e07281254552 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1988,6 +1988,7 @@ static __latent_entropy struct task_struct *copy_process(
 					  &p->signal->thread_head);
 		}
 		attach_pid(p, PIDTYPE_PID);
+		calculate_sigpending(p);
 		nr_threads++;
 	}
 
diff --git a/kernel/signal.c b/kernel/signal.c
index dddbea558455..f6687c7d7a8c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -172,6 +172,19 @@ void recalc_sigpending(void)
 
 }
 
+void calculate_sigpending(struct task_struct *new)
+{
+	/* Have any signals or users of TIF_SIGPENDING been delayed
+	 * until after fork?
+	 */
+	bool pending = (new->jobctl & JOBCTL_PENDING_MASK) ||
+		PENDING(&new->pending, &new->blocked) ||
+		PENDING(&new->signal->shared_pending, &new->blocked) ||
+		freezing(new) || klp_patch_pending(new);
+
+	update_tsk_thread_flag(new, TIF_SIGPENDING, pending);
+}
+
 /* Given the mask, find the first available signal that should be serviced. */
 
 #define SYNCHRONOUS_MASK \
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 19/20] fork: Have new threads join on-going signal group stops
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (17 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 18/20] signal: Add calculate_sigpending() Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24  3:24                           ` [PATCH 20/20] signal: Don't restart fork when signals come in Eric W. Biederman
                                             ` (3 subsequent siblings)
  22 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL.  Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered.  Which
leaves only SIGSTOP that needs to be considered when creating new
threads.

Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.

A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.

It is easy so check to see if a SIGSTOP is ongoing have have the new
thread join it immediate after the clone completes.  Making it appear
the clone completed happened just before the SIGSTOP.

The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  2 ++
 kernel/fork.c                | 27 +++++++++++++++------------
 kernel/signal.c              | 14 ++++++++++++++
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7cabc0bc38f6..f3507bf165d0 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -385,6 +385,8 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 	signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
 }
 
+void task_join_group_stop(struct task_struct *task);
+
 #ifdef TIF_RESTORE_SIGMASK
 /*
  * Legacy restore_sigmask accessors.  These are inefficient on
diff --git a/kernel/fork.c b/kernel/fork.c
index e07281254552..6c358846a8b8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1934,18 +1934,20 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
-	/*
-	 * Process group and session signals need to be delivered to just the
-	 * parent before the fork or both the parent and the child after the
-	 * fork. Restart if a signal comes in before we add the new process to
-	 * it's process group.
-	 * A fatal signal pending means that current will exit, so the new
-	 * thread can't slip out of an OOM kill (or normal SIGKILL).
-	*/
-	recalc_sigpending();
-	if (signal_pending(current)) {
-		retval = -ERESTARTNOINTR;
-		goto bad_fork_cancel_cgroup;
+	if (!(clone_flags & CLONE_THREAD)) {
+		/*
+		 * Process group and session signals need to be delivered to just the
+		 * parent before the fork or both the parent and the child after the
+		 * fork. Restart if a signal comes in before we add the new process to
+		 * it's process group.
+		 * A fatal signal pending means that current will exit, so the new
+		 * thread can't slip out of an OOM kill (or normal SIGKILL).
+		 */
+		recalc_sigpending();
+		if (signal_pending(current)) {
+			retval = -ERESTARTNOINTR;
+			goto bad_fork_cancel_cgroup;
+		}
 	}
 
 
@@ -1986,6 +1988,7 @@ static __latent_entropy struct task_struct *copy_process(
 					  &p->group_leader->thread_group);
 			list_add_tail_rcu(&p->thread_node,
 					  &p->signal->thread_head);
+			task_join_group_stop(p);
 		}
 		attach_pid(p, PIDTYPE_PID);
 		calculate_sigpending(p);
diff --git a/kernel/signal.c b/kernel/signal.c
index f6687c7d7a8c..78e2d5d196f3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -375,6 +375,20 @@ static bool task_participate_group_stop(struct task_struct *task)
 	return false;
 }
 
+void task_join_group_stop(struct task_struct *task)
+{
+	/* Have the new thread join an on-going signal group stop */
+	unsigned long jobctl = current->jobctl;
+	if (jobctl & JOBCTL_STOP_PENDING) {
+		struct signal_struct *sig = current->signal;
+		unsigned long signr = jobctl & JOBCTL_STOP_SIGMASK;
+		unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME;
+		if (task_set_jobctl_pending(task, signr | gstop)) {
+			sig->group_stop_count++;
+		}
+	}
+}
+
 /*
  * allocate a new signal queue record
  * - this may be called without locks if and only if t == current, otherwise an
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 20/20] signal: Don't restart fork when signals come in.
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (18 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 19/20] fork: Have new threads join on-going signal group stops Eric W. Biederman
@ 2018-07-24  3:24                           ` Eric W. Biederman
  2018-07-24 17:27                             ` Linus Torvalds
  2018-07-24 17:29                           ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Linus Torvalds
                                             ` (2 subsequent siblings)
  22 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24  3:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	Eric W. Biederman

Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.

The code was being overly pesimistic.  Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child.  For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().

While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread.  Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.

Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes.  When fork completes initialize the new processes
shared_pending signal set with it.  The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals.  Making it
appear as if those signals were received immediately after the fork.

It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo.  This
means it is safe to not bother collecting siginfo on signals sent
during fork.

The sigaction of a child of fork is initially the same as the
sigaction of the parent process.  So a signal the parent ignores the
child will also initially ignore.  Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.

Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with.  They won't cause any problems.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  8 ++++++++
 kernel/fork.c                | 37 ++++++++++++++++++++----------------
 kernel/signal.c              |  9 +++++++++
 3 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index f3507bf165d0..62262021cf7e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -69,6 +69,11 @@ struct thread_group_cputimer {
 	bool checking_timer;
 };
 
+struct multiprocess_signals {
+	sigset_t signal;
+	struct hlist_node node;
+};
+
 /*
  * NOTE! "signal_struct" does not have its own
  * locking, because a shared signal_struct always
@@ -90,6 +95,9 @@ struct signal_struct {
 	/* shared signal handling: */
 	struct sigpending	shared_pending;
 
+	/* For collecting multiprocess signals during fork */
+	struct hlist_head	multiprocess;
+
 	/* thread group exit support */
 	int			group_exit_code;
 	/* overloaded:
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c358846a8b8..6ee5822f0085 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1456,6 +1456,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	init_waitqueue_head(&sig->wait_chldexit);
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
+	INIT_HLIST_HEAD(&sig->multiprocess);
 	seqlock_init(&sig->stats_lock);
 	prev_cputime_init(&sig->prev_cputime);
 
@@ -1602,6 +1603,24 @@ static __latent_entropy struct task_struct *copy_process(
 {
 	int retval;
 	struct task_struct *p;
+	struct multiprocess_signals delayed;
+
+	/*
+	 * Force any signals received before this point to be delivered
+	 * before the fork happens.  Collect up signals sent to multiple
+	 * processes that happen during the fork and delay them so that
+	 * they appear to happen after the fork.
+	 */
+	sigemptyset(&delayed.signal);
+	INIT_HLIST_NODE(&delayed.node);
+
+	spin_lock_irq(&current->sighand->siglock);
+	if (!(clone_flags & CLONE_THREAD))
+		hlist_add_head(&delayed.node, &current->signal->multiprocess);
+	recalc_sigpending();
+	spin_unlock_irq(&current->sighand->siglock);
+	if (signal_pending(current))
+		return ERR_PTR(restart_syscall());
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1934,22 +1953,6 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
-	if (!(clone_flags & CLONE_THREAD)) {
-		/*
-		 * Process group and session signals need to be delivered to just the
-		 * parent before the fork or both the parent and the child after the
-		 * fork. Restart if a signal comes in before we add the new process to
-		 * it's process group.
-		 * A fatal signal pending means that current will exit, so the new
-		 * thread can't slip out of an OOM kill (or normal SIGKILL).
-		 */
-		recalc_sigpending();
-		if (signal_pending(current)) {
-			retval = -ERESTARTNOINTR;
-			goto bad_fork_cancel_cgroup;
-		}
-	}
-
 
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
@@ -1979,6 +1982,8 @@ static __latent_entropy struct task_struct *copy_process(
 			attach_pid(p, PIDTYPE_TGID);
 			attach_pid(p, PIDTYPE_PGID);
 			attach_pid(p, PIDTYPE_SID);
+			p->signal->shared_pending.signal = delayed.signal;
+			hlist_del(&delayed.node);
 			__this_cpu_inc(process_counts);
 		} else {
 			current->signal->nr_threads++;
diff --git a/kernel/signal.c b/kernel/signal.c
index 78e2d5d196f3..5b1aab94daf6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1123,6 +1123,15 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
+
+	/* Let multiprocess signals appear after on-going forks */
+	if (type > PIDTYPE_TGID) {
+		struct multiprocess_signals *delayed;
+		hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
+			sigaddset(&delayed->signal, sig);
+		}
+	}
+
 	complete_signal(sig, t, type);
 ret:
 	trace_signal_generate(sig, info, t, type != PIDTYPE_PID, result);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 20/20] signal: Don't restart fork when signals come in.
  2018-07-24  3:24                           ` [PATCH 20/20] signal: Don't restart fork when signals come in Eric W. Biederman
@ 2018-07-24 17:27                             ` Linus Torvalds
  2018-07-24 17:58                               ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2018-07-24 17:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

This is completely broken.

On Mon, Jul 23, 2018 at 8:27 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6c358846a8b8..6ee5822f0085 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1602,6 +1603,24 @@ static __latent_entropy struct task_struct *copy_process(
>  {
>         int retval;
>         struct task_struct *p;
> +       struct multiprocess_signals delayed;
> +
> +       /*
> +        * Force any signals received before this point to be delivered
> +        * before the fork happens.  Collect up signals sent to multiple
> +        * processes that happen during the fork and delay them so that
> +        * they appear to happen after the fork.
> +        */
> +       sigemptyset(&delayed.signal);
> +       INIT_HLIST_NODE(&delayed.node);
> +
> +       spin_lock_irq(&current->sighand->siglock);
> +       if (!(clone_flags & CLONE_THREAD))
> +               hlist_add_head(&delayed.node, &current->signal->multiprocess);

Here you add the entry to the multiprocess list.

> +       recalc_sigpending();
> +       spin_unlock_irq(&current->sighand->siglock);
> +       if (signal_pending(current))
> +               return ERR_PTR(restart_syscall());

.. and here you return with the list entry still there, pointing to
the stack that you now no longer use.

The same is true of *all* the error cases, because the only point you
remove it is for the success case:

> @@ -1979,6 +1982,8 @@ static __latent_entropy struct task_struct *copy_process(
>                         attach_pid(p, PIDTYPE_TGID);
>                         attach_pid(p, PIDTYPE_PGID);
>                         attach_pid(p, PIDTYPE_SID);
> +                       p->signal->shared_pending.signal = delayed.signal;
> +                       hlist_del(&delayed.node);

So for all the error cases, you leave a dangling pointer to the
current stack in that signal handler, and then return an error.

Guaranteed stack and list corruption.

                 Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/20] PIDTYPE_TGID removal of fork restarts
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (19 preceding siblings ...)
  2018-07-24  3:24                           ` [PATCH 20/20] signal: Don't restart fork when signals come in Eric W. Biederman
@ 2018-07-24 17:29                           ` Linus Torvalds
  2018-07-25 16:05                           ` Oleg Nesterov
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
  22 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2018-07-24 17:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Mon, Jul 23, 2018 at 8:23 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Please take a look and verify that I have caught everything.  I think I
> have but if not please let me know.

Looks mostly ok, except for the completely broken cleanup of the
"multiprocess" list.

               Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 20/20] signal: Don't restart fork when signals come in.
  2018-07-24 17:27                             ` Linus Torvalds
@ 2018-07-24 17:58                               ` Eric W. Biederman
  2018-07-24 18:29                                 ` Linus Torvalds
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24 17:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Linus Torvalds <torvalds@linux-foundation.org> writes:

> This is completely broken.
>
> On Mon, Jul 23, 2018 at 8:27 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 6c358846a8b8..6ee5822f0085 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1602,6 +1603,24 @@ static __latent_entropy struct task_struct *copy_process(
>>  {
>>         int retval;
>>         struct task_struct *p;
>> +       struct multiprocess_signals delayed;
>> +
>> +       /*
>> +        * Force any signals received before this point to be delivered
>> +        * before the fork happens.  Collect up signals sent to multiple
>> +        * processes that happen during the fork and delay them so that
>> +        * they appear to happen after the fork.
>> +        */
>> +       sigemptyset(&delayed.signal);
>> +       INIT_HLIST_NODE(&delayed.node);
>> +
>> +       spin_lock_irq(&current->sighand->siglock);
>> +       if (!(clone_flags & CLONE_THREAD))
>> +               hlist_add_head(&delayed.node, &current->signal->multiprocess);
>
> Here you add the entry to the multiprocess list.
>
>> +       recalc_sigpending();
>> +       spin_unlock_irq(&current->sighand->siglock);
>> +       if (signal_pending(current))
>> +               return ERR_PTR(restart_syscall());
>
> .. and here you return with the list entry still there, pointing to
> the stack that you now no longer use.
>
> The same is true of *all* the error cases, because the only point you
> remove it is for the success case:

Yes you are quite right.   Easy enough to fix, but it definitely needs
to be fixed.

I will respin.

Eric


>> @@ -1979,6 +1982,8 @@ static __latent_entropy struct task_struct *copy_process(
>>                         attach_pid(p, PIDTYPE_TGID);
>>                         attach_pid(p, PIDTYPE_PGID);
>>                         attach_pid(p, PIDTYPE_SID);
>> +                       p->signal->shared_pending.signal = delayed.signal;
>> +                       hlist_del(&delayed.node);
>
> So for all the error cases, you leave a dangling pointer to the
> current stack in that signal handler, and then return an error.
>
> Guaranteed stack and list corruption.
>
>                  Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 20/20] signal: Don't restart fork when signals come in.
  2018-07-24 17:58                               ` Eric W. Biederman
@ 2018-07-24 18:29                                 ` Linus Torvalds
  2018-07-24 20:05                                   ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2018-07-24 18:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Tue, Jul 24, 2018 at 10:58 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Yes you are quite right.   Easy enough to fix, but it definitely needs
> to be fixed.
>
> I will respin.

Would you mind trying a slightly different approach for this?

How about moving the "copy_signal()" and "copy_sighandler()" cases up
to fairly early in the "copy_process()" function (and clean up late,
obviously).

Then, instead of that "struct multiprocess_signals" thing, just add a
"struct hlist_node node" to "struct signal" itself, and add it to the
multiprocess hlist there.

And then you can just remove it in bad_fork_cleanup_signal.

Does this make "struct signal" a bit larger? Yes, but it's not a huge
deal. We *could* make is some union with existing fields if we cared.

But I think it would make the code *much* more understandable, and it
would allow us to not have that "sigpending" copy, because you can
just populate the final "signal->shared_pending" directly.

Hmm?

             Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 20/20] signal: Don't restart fork when signals come in.
  2018-07-24 18:29                                 ` Linus Torvalds
@ 2018-07-24 20:05                                   ` Eric W. Biederman
  2018-07-24 20:15                                     ` Linus Torvalds
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24 20:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jul 24, 2018 at 10:58 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Yes you are quite right.   Easy enough to fix, but it definitely needs
>> to be fixed.
>>
>> I will respin.
>
> Would you mind trying a slightly different approach for this?
>
> How about moving the "copy_signal()" and "copy_sighandler()" cases up
> to fairly early in the "copy_process()" function (and clean up late,
> obviously).
>
> Then, instead of that "struct multiprocess_signals" thing, just add a
> "struct hlist_node node" to "struct signal" itself, and add it to the
> multiprocess hlist there.
>
> And then you can just remove it in bad_fork_cleanup_signal.
>
> Does this make "struct signal" a bit larger? Yes, but it's not a huge
> deal. We *could* make is some union with existing fields if we cared.
>
> But I think it would make the code *much* more understandable, and it
> would allow us to not have that "sigpending" copy, because you can
> just populate the final "signal->shared_pending" directly.
>
> Hmm?

I don't like it.

What I hear you asking is  moving up copy_signal copy_sighand copy_creds
and alloc_pid, and anything else that signal delivery might depend on.

Then in send_signal running __send_signal in a loop first for the
forking process and then for every process that is currently in the
middle of fork.

It feels like this gets us much later in fork, and there is a lot more
code to move and review.   Which to some extent makes us more
susceptible to periodic signals, as more work will be thrown away and
have to be redone.  Plus it makes the whole thing susceptible to signal
delivery growing some additional dependency (perhaps cgroups?) and that
getting missed and never noticed until someone manages to time a sending
a signal just right.

I really want something very simple and straight forward because I don't
see us testing or hitting this code path much in practice.  Moving this
into the middle of fork and adding more depedencies does not seem like
it will be that kind of straight forward.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 20/20] signal: Don't restart fork when signals come in.
  2018-07-24 20:05                                   ` Eric W. Biederman
@ 2018-07-24 20:15                                     ` Linus Torvalds
  2018-07-24 20:40                                       ` [PATCH v2 " Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2018-07-24 20:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Tue, Jul 24, 2018 at 1:05 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> What I hear you asking is  moving up copy_signal copy_sighand copy_creds
> and alloc_pid, and anything else that signal delivery might depend on.

No, _just_ signal allocation.

It would still just use the special-case list to set the pending bit.
No creds, no "full task", nothing like that.

> I really want something very simple and straight forward because I don't
> see us testing or hitting this code path much in practice.  Moving this
> into the middle of fork and adding more depedencies does not seem like
> it will be that kind of straight forward.

I think your "list on the stack" was anything but straightforward,
considering how utterly broken the error handling of the patch was.

But hey., send a fixed patch and see how it looks. I think you'll end
up adding a lot of "goto signal_cleanup" cases.

          Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v2 20/20] signal: Don't restart fork when signals come in.
  2018-07-24 20:15                                     ` Linus Torvalds
@ 2018-07-24 20:40                                       ` Eric W. Biederman
  2018-07-24 20:56                                         ` Linus Torvalds
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24 20:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang


Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.

The code was being overly pesimistic.  Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child.  For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().

While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread.  Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.

Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes.  When fork completes initialize the new processes
shared_pending signal set with it.  The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals.  Making it
appear as if those signals were received immediately after the fork.

It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo.  This
means it is safe to not bother collecting siginfo on signals sent
during fork.

The sigaction of a child of fork is initially the same as the
sigaction of the parent process.  So a signal the parent ignores the
child will also initially ignore.  Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.

Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with.  They won't cause any problems.

V2: Added removal from the multiprocess list on failure.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  8 +++++++
 kernel/fork.c                | 44 +++++++++++++++++++++++-------------
 kernel/signal.c              |  9 ++++++++
 3 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index f3507bf165d0..62262021cf7e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -69,6 +69,11 @@ struct thread_group_cputimer {
 	bool checking_timer;
 };
 
+struct multiprocess_signals {
+	sigset_t signal;
+	struct hlist_node node;
+};
+
 /*
  * NOTE! "signal_struct" does not have its own
  * locking, because a shared signal_struct always
@@ -90,6 +95,9 @@ struct signal_struct {
 	/* shared signal handling: */
 	struct sigpending	shared_pending;
 
+	/* For collecting multiprocess signals during fork */
+	struct hlist_head	multiprocess;
+
 	/* thread group exit support */
 	int			group_exit_code;
 	/* overloaded:
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c358846a8b8..7cd3d22bca94 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1456,6 +1456,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	init_waitqueue_head(&sig->wait_chldexit);
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
+	INIT_HLIST_HEAD(&sig->multiprocess);
 	seqlock_init(&sig->stats_lock);
 	prev_cputime_init(&sig->prev_cputime);
 
@@ -1602,6 +1603,7 @@ static __latent_entropy struct task_struct *copy_process(
 {
 	int retval;
 	struct task_struct *p;
+	struct multiprocess_signals delayed;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1649,6 +1651,25 @@ static __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * Force any signals received before this point to be delivered
+	 * before the fork happens.  Collect up signals sent to multiple
+	 * processes that happen during the fork and delay them so that
+	 * they appear to happen after the fork.
+	 */
+	sigemptyset(&delayed.signal);
+	INIT_HLIST_NODE(&delayed.node);
+
+	spin_lock_irq(&current->sighand->siglock);
+	if (!(clone_flags & CLONE_THREAD))
+		hlist_add_head(&delayed.node, &current->signal->multiprocess);
+	recalc_sigpending();
+	spin_unlock_irq(&current->sighand->siglock);
+	if (signal_pending(current)) {
+		retval = restart_syscall();
+		goto fork_out;
+	}
+
 	retval = -ENOMEM;
 	p = dup_task_struct(current, node);
 	if (!p)
@@ -1934,22 +1955,6 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
-	if (!(clone_flags & CLONE_THREAD)) {
-		/*
-		 * Process group and session signals need to be delivered to just the
-		 * parent before the fork or both the parent and the child after the
-		 * fork. Restart if a signal comes in before we add the new process to
-		 * it's process group.
-		 * A fatal signal pending means that current will exit, so the new
-		 * thread can't slip out of an OOM kill (or normal SIGKILL).
-		 */
-		recalc_sigpending();
-		if (signal_pending(current)) {
-			retval = -ERESTARTNOINTR;
-			goto bad_fork_cancel_cgroup;
-		}
-	}
-
 
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
@@ -1979,6 +1984,8 @@ static __latent_entropy struct task_struct *copy_process(
 			attach_pid(p, PIDTYPE_TGID);
 			attach_pid(p, PIDTYPE_PGID);
 			attach_pid(p, PIDTYPE_SID);
+			p->signal->shared_pending.signal = delayed.signal;
+			hlist_del(&delayed.node);
 			__this_cpu_inc(process_counts);
 		} else {
 			current->signal->nr_threads++;
@@ -2060,6 +2067,11 @@ static __latent_entropy struct task_struct *copy_process(
 	put_task_stack(p);
 	free_task(p);
 fork_out:
+	if (!(clone_flags & CLONE_THREAD)) {
+		spin_lock_irq(&current->sighand->siglock);
+		hlist_del(&delayed.node);
+		spin_unlock_irq(&current->sighand->siglock);
+	}
 	return ERR_PTR(retval);
 }
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 78e2d5d196f3..5b1aab94daf6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1123,6 +1123,15 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
+
+	/* Let multiprocess signals appear after on-going forks */
+	if (type > PIDTYPE_TGID) {
+		struct multiprocess_signals *delayed;
+		hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
+			sigaddset(&delayed->signal, sig);
+		}
+	}
+
 	complete_signal(sig, t, type);
 ret:
 	trace_signal_generate(sig, info, t, type != PIDTYPE_PID, result);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 20/20] signal: Don't restart fork when signals come in.
  2018-07-24 20:40                                       ` [PATCH v2 " Eric W. Biederman
@ 2018-07-24 20:56                                         ` Linus Torvalds
  2018-07-24 21:24                                           ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Linus Torvalds @ 2018-07-24 20:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Tue, Jul 24, 2018 at 1:40 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> +       if (signal_pending(current)) {
> +               retval = restart_syscall();
> +               goto fork_out;
> +       }

Oh, the previous version had this too, but it wasn't as obvious
because it was just in a single line:

        return ERR_PTR(restart_syscall());

but it's just crazy.

It should just be

        retval = -ERESTARTNOINTR;
        if (signal_pending(current))
                goto fork_out;

because it's just silly and pointless to change the code to use
restart_syscall() here.

All restart_syscall() does is

        set_tsk_thread_flag(current, TIF_SIGPENDING);
        return -ERESTARTNOINTR;

and you just *checked* that TIF_SIGPENDING was already set. So the
above is completely pointless.

It is not clear why you made that change. The old code had the simpler
"just return -ERESTARTNOINTR" model.

Did the restart_syscall() thing come in by mistake from some previous
trials and it just hung around?

            Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2 20/20] signal: Don't restart fork when signals come in.
  2018-07-24 20:56                                         ` Linus Torvalds
@ 2018-07-24 21:24                                           ` Eric W. Biederman
  2018-07-25  3:56                                             ` [PATCH v3 " Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-24 21:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jul 24, 2018 at 1:40 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> +       if (signal_pending(current)) {
>> +               retval = restart_syscall();
>> +               goto fork_out;
>> +       }
>
> Oh, the previous version had this too, but it wasn't as obvious
> because it was just in a single line:
>
>         return ERR_PTR(restart_syscall());
>
> but it's just crazy.
>
> It should just be
>
>         retval = -ERESTARTNOINTR;
>         if (signal_pending(current))
>                 goto fork_out;
>
> because it's just silly and pointless to change the code to use
> restart_syscall() here.
>
> All restart_syscall() does is
>
>         set_tsk_thread_flag(current, TIF_SIGPENDING);
>         return -ERESTARTNOINTR;
>
> and you just *checked* that TIF_SIGPENDING was already set. So the
> above is completely pointless.
>
> It is not clear why you made that change. The old code had the simpler
> "just return -ERESTARTNOINTR" model.
>
> Did the restart_syscall() thing come in by mistake from some previous
> trials and it just hung around?

I think this is the one place in the kernel where we can restart a
system call and not set TIF_SIGPENDING.

Several years ago I made the mistake in the networking code of returning
-ERESTARTNOINTR and forgetting to set TIF_SIGPENDING.  That wasn't fun.
So I wrote restart_syscall and use it so I don't make that mistake
again.

In this case your suggesting will definitely generate better code so I
am happy to make a V3 with that doesn't use restart_syscall.  A person
working in the guts of fork can reasonably be expected understand and to
have all of the subtleties in cache as they work on that part of fork.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v3 20/20] signal: Don't restart fork when signals come in.
  2018-07-24 21:24                                           ` Eric W. Biederman
@ 2018-07-25  3:56                                             ` Eric W. Biederman
  2018-07-26 13:41                                               ` Oleg Nesterov
  2018-07-26 16:21                                               ` Oleg Nesterov
  0 siblings, 2 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-25  3:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang


Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.

The code was being overly pesimistic.  Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child.  For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().

While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread.  Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.

Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes.  When fork completes initialize the new processes
shared_pending signal set with it.  The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals.  Making it
appear as if those signals were received immediately after the fork.

It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo.  This
means it is safe to not bother collecting siginfo on signals sent
during fork.

The sigaction of a child of fork is initially the same as the
sigaction of the parent process.  So a signal the parent ignores the
child will also initially ignore.  Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.

Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with.  They won't cause any problems.

V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  8 +++++++
 kernel/fork.c                | 43 ++++++++++++++++++++++--------------
 kernel/signal.c              |  9 ++++++++
 3 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index f3507bf165d0..62262021cf7e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -69,6 +69,11 @@ struct thread_group_cputimer {
 	bool checking_timer;
 };
 
+struct multiprocess_signals {
+	sigset_t signal;
+	struct hlist_node node;
+};
+
 /*
  * NOTE! "signal_struct" does not have its own
  * locking, because a shared signal_struct always
@@ -90,6 +95,9 @@ struct signal_struct {
 	/* shared signal handling: */
 	struct sigpending	shared_pending;
 
+	/* For collecting multiprocess signals during fork */
+	struct hlist_head	multiprocess;
+
 	/* thread group exit support */
 	int			group_exit_code;
 	/* overloaded:
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c358846a8b8..94951fb562db 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1456,6 +1456,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	init_waitqueue_head(&sig->wait_chldexit);
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
+	INIT_HLIST_HEAD(&sig->multiprocess);
 	seqlock_init(&sig->stats_lock);
 	prev_cputime_init(&sig->prev_cputime);
 
@@ -1602,6 +1603,7 @@ static __latent_entropy struct task_struct *copy_process(
 {
 	int retval;
 	struct task_struct *p;
+	struct multiprocess_signals delayed;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1649,6 +1651,24 @@ static __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * Force any signals received before this point to be delivered
+	 * before the fork happens.  Collect up signals sent to multiple
+	 * processes that happen during the fork and delay them so that
+	 * they appear to happen after the fork.
+	 */
+	sigemptyset(&delayed.signal);
+	INIT_HLIST_NODE(&delayed.node);
+
+	spin_lock_irq(&current->sighand->siglock);
+	if (!(clone_flags & CLONE_THREAD))
+		hlist_add_head(&delayed.node, &current->signal->multiprocess);
+	recalc_sigpending();
+	spin_unlock_irq(&current->sighand->siglock);
+	retval = -ERESTARTNOINTR;
+	if (signal_pending(current))
+		goto fork_out;
+
 	retval = -ENOMEM;
 	p = dup_task_struct(current, node);
 	if (!p)
@@ -1934,22 +1954,6 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
-	if (!(clone_flags & CLONE_THREAD)) {
-		/*
-		 * Process group and session signals need to be delivered to just the
-		 * parent before the fork or both the parent and the child after the
-		 * fork. Restart if a signal comes in before we add the new process to
-		 * it's process group.
-		 * A fatal signal pending means that current will exit, so the new
-		 * thread can't slip out of an OOM kill (or normal SIGKILL).
-		 */
-		recalc_sigpending();
-		if (signal_pending(current)) {
-			retval = -ERESTARTNOINTR;
-			goto bad_fork_cancel_cgroup;
-		}
-	}
-
 
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
@@ -1979,6 +1983,8 @@ static __latent_entropy struct task_struct *copy_process(
 			attach_pid(p, PIDTYPE_TGID);
 			attach_pid(p, PIDTYPE_PGID);
 			attach_pid(p, PIDTYPE_SID);
+			p->signal->shared_pending.signal = delayed.signal;
+			hlist_del(&delayed.node);
 			__this_cpu_inc(process_counts);
 		} else {
 			current->signal->nr_threads++;
@@ -2060,6 +2066,11 @@ static __latent_entropy struct task_struct *copy_process(
 	put_task_stack(p);
 	free_task(p);
 fork_out:
+	if (!(clone_flags & CLONE_THREAD)) {
+		spin_lock_irq(&current->sighand->siglock);
+		hlist_del(&delayed.node);
+		spin_unlock_irq(&current->sighand->siglock);
+	}
 	return ERR_PTR(retval);
 }
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 78e2d5d196f3..5b1aab94daf6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1123,6 +1123,15 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
+
+	/* Let multiprocess signals appear after on-going forks */
+	if (type > PIDTYPE_TGID) {
+		struct multiprocess_signals *delayed;
+		hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
+			sigaddset(&delayed->signal, sig);
+		}
+	}
+
 	complete_signal(sig, t, type);
 ret:
 	trace_signal_generate(sig, info, t, type != PIDTYPE_PID, result);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/20] PIDTYPE_TGID removal of fork restarts
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (20 preceding siblings ...)
  2018-07-24 17:29                           ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Linus Torvalds
@ 2018-07-25 16:05                           ` Oleg Nesterov
  2018-07-25 16:58                             ` Eric W. Biederman
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
  22 siblings, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-25 16:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

On 07/23, Eric W. Biederman wrote:
>
>       signal: Add calculate_sigpending()
>       fork: Have new threads join on-going signal group stops
>       signal: Don't restart fork when signals come in.

Oh, I need to re-read these patches tomorrow. I have some concerns, but perhaps
I am wrong. Will write another email tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/20] PIDTYPE_TGID removal of fork restarts
  2018-07-25 16:05                           ` Oleg Nesterov
@ 2018-07-25 16:58                             ` Eric W. Biederman
  0 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-25 16:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/23, Eric W. Biederman wrote:
>>
>>       signal: Add calculate_sigpending()
>>       fork: Have new threads join on-going signal group stops
>>       signal: Don't restart fork when signals come in.
>
> Oh, I need to re-read these patches tomorrow. I have some concerns, but perhaps
> I am wrong. Will write another email tomorrow.

Thank you.  I look forward to seeing what you think after you have
re-read them.

Eric


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 18/20] signal: Add calculate_sigpending()
  2018-07-24  3:24                           ` [PATCH 18/20] signal: Add calculate_sigpending() Eric W. Biederman
@ 2018-07-26 13:20                             ` Oleg Nesterov
  2018-07-26 15:13                               ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-26 13:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

On 07/23, Eric W. Biederman wrote:
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1988,6 +1988,7 @@ static __latent_entropy struct task_struct *copy_process(
>  					  &p->signal->thread_head);
>  		}
>  		attach_pid(p, PIDTYPE_PID);
> +		calculate_sigpending(p);

In theory this looks racy if !CLONE_SIGHAND, please see below

> +void calculate_sigpending(struct task_struct *new)
> +{
> +	/* Have any signals or users of TIF_SIGPENDING been delayed
> +	 * until after fork?
> +	 */
> +	bool pending = (new->jobctl & JOBCTL_PENDING_MASK) ||
> +		PENDING(&new->pending, &new->blocked) ||
> +		PENDING(&new->signal->shared_pending, &new->blocked) ||
> +		freezing(new) || klp_patch_pending(new);

note that we do not hold new->sighand->siglock, but this "new" task is already
visible to find_task_by_vpid/etc; so a new signal can come right after this check,

> +	update_tsk_thread_flag(new, TIF_SIGPENDING, pending);

and then update_tsk_thread_flag() can wrongly clear TIF_SIGPENDING.

Easy to fix, but perhaps we can simply add recalc_sigpending() into
schedule_tail() ? It already does more than just finish_task_switch/etc.

This way we do not need the new helper (which btw can only be used by
copy_process).


Note also that either way you can remove set_tsk_thread_flag(TIF_SIGPENDING)
from ptrace_init_task().

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v3 20/20] signal: Don't restart fork when signals come in.
  2018-07-25  3:56                                             ` [PATCH v3 " Eric W. Biederman
@ 2018-07-26 13:41                                               ` Oleg Nesterov
  2018-07-26 14:42                                                 ` Eric W. Biederman
  2018-07-26 16:21                                               ` Oleg Nesterov
  1 sibling, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-26 13:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On 07/24, Eric W. Biederman wrote:
>
> @@ -1979,6 +1983,8 @@ static __latent_entropy struct task_struct *copy_process(
>  			attach_pid(p, PIDTYPE_TGID);
>  			attach_pid(p, PIDTYPE_PGID);
>  			attach_pid(p, PIDTYPE_SID);
> +			p->signal->shared_pending.signal = delayed.signal;

Again, in this case we do not hold p->sighand->siglock (unless CLONE_SIGHAND),
so this should be done before list_add_tail/attach_pid above. Plus we need some
sort of barrier.

Or we can do

	if (!CLONE_SIGHAND)
		spin_lock_nested(child->siglock);

at the start of "if (likely(p->pid))" block.

> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1123,6 +1123,15 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
>  out_set:
>  	signalfd_notify(t, sig);
>  	sigaddset(&pending->signal, sig);
> +
> +	/* Let multiprocess signals appear after on-going forks */
> +	if (type > PIDTYPE_TGID) {
> +		struct multiprocess_signals *delayed;
> +		hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
> +			sigaddset(&delayed->signal, sig);

This is not enough, I think...

Suppose you send SIGSTOP and then SIGCONT to some process group. The 1st SIGSTOP
will be queued correctly, but the 2nd SIGCONT won't flush the pending SIGSTOP, you
need to modify prepare_signal() too.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v3 20/20] signal: Don't restart fork when signals come in.
  2018-07-26 13:41                                               ` Oleg Nesterov
@ 2018-07-26 14:42                                                 ` Eric W. Biederman
  2018-07-26 15:55                                                   ` Oleg Nesterov
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-26 14:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/24, Eric W. Biederman wrote:
>>
>> @@ -1979,6 +1983,8 @@ static __latent_entropy struct task_struct *copy_process(
>>  			attach_pid(p, PIDTYPE_TGID);
>>  			attach_pid(p, PIDTYPE_PGID);
>>  			attach_pid(p, PIDTYPE_SID);
>> +			p->signal->shared_pending.signal = delayed.signal;
>
> Again, in this case we do not hold p->sighand->siglock (unless CLONE_SIGHAND),
> so this should be done before list_add_tail/attach_pid above. Plus we need some
> sort of barrier.
>
> Or we can do
>
> 	if (!CLONE_SIGHAND)
> 		spin_lock_nested(child->siglock);
>
> at the start of "if (likely(p->pid))" block.

Good point.  We want to exclude races between new signals comming in and
gathering the information from the old queued signals.

I will take a look.

>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -1123,6 +1123,15 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
>>  out_set:
>>  	signalfd_notify(t, sig);
>>  	sigaddset(&pending->signal, sig);
>> +
>> +	/* Let multiprocess signals appear after on-going forks */
>> +	if (type > PIDTYPE_TGID) {
>> +		struct multiprocess_signals *delayed;
>> +		hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
>> +			sigaddset(&delayed->signal, sig);
>
> This is not enough, I think...
>
> Suppose you send SIGSTOP and then SIGCONT to some process group. The 1st SIGSTOP
> will be queued correctly, but the 2nd SIGCONT won't flush the pending SIGSTOP, you
> need to modify prepare_signal() too.

Good point.  We can't have both SIGCONT and a stop signal (SIGSTOP or
SIGTSTP) enqueued at the same time.  And there is something in the
prepare_signal code about parent notifications as well.

I will look up the fine points of SIGCONT and SIGSTOP interaction
and see what we should be doing here.

Sigh.  I thought this was going to be as simple as the sequence counter
but this looks a tad more complicated.

Are the earlier patches looking ok to you?

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 18/20] signal: Add calculate_sigpending()
  2018-07-26 13:20                             ` Oleg Nesterov
@ 2018-07-26 15:13                               ` Eric W. Biederman
  2018-07-26 16:24                                 ` Oleg Nesterov
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-07-26 15:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/23, Eric W. Biederman wrote:
>>
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1988,6 +1988,7 @@ static __latent_entropy struct task_struct *copy_process(
>>  					  &p->signal->thread_head);
>>  		}
>>  		attach_pid(p, PIDTYPE_PID);
>> +		calculate_sigpending(p);
>
> In theory this looks racy if !CLONE_SIGHAND, please see below
>
>> +void calculate_sigpending(struct task_struct *new)
>> +{
>> +	/* Have any signals or users of TIF_SIGPENDING been delayed
>> +	 * until after fork?
>> +	 */
>> +	bool pending = (new->jobctl & JOBCTL_PENDING_MASK) ||
>> +		PENDING(&new->pending, &new->blocked) ||
>> +		PENDING(&new->signal->shared_pending, &new->blocked) ||
>> +		freezing(new) || klp_patch_pending(new);
>
> note that we do not hold new->sighand->siglock, but this "new" task is already
> visible to find_task_by_vpid/etc; so a new signal can come right after
> this check,

Good point.  The localtion of the call to calculate_sigpending is wrong.

>> +	update_tsk_thread_flag(new, TIF_SIGPENDING, pending);
>
> and then update_tsk_thread_flag() can wrongly clear TIF_SIGPENDING.
>
> Easy to fix, but perhaps we can simply add recalc_sigpending() into
> schedule_tail() ? It already does more than just finish_task_switch/etc.
>
> This way we do not need the new helper (which btw can only be used by
> copy_process).

The problem I have with reusing recalc_sigpending is that it does not
set TIF_SIGPENDING if (freezing || klp_patch_pending).

There is obviously synergy between these two cases, I just have not
figured out how to take advantage of it yet.

> Note also that either way you can remove set_tsk_thread_flag(TIF_SIGPENDING)
> from ptrace_init_task().

Interesting.  Yes we can remove TIF_SIGPENDING from that case because
ptrace_init_task sets jobctl or queues a pending signal.  I like
that synergy.  I like not being able to miss setting TIF_SIGPENDING
during fork.

Eric


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v3 20/20] signal: Don't restart fork when signals come in.
  2018-07-26 14:42                                                 ` Eric W. Biederman
@ 2018-07-26 15:55                                                   ` Oleg Nesterov
  2018-08-09  6:19                                                     ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-26 15:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On 07/26, Eric W. Biederman wrote:
>
> Are the earlier patches looking ok to you?

I obviously like 1-15.

"[PATCH 16/20] fork: Move and describe why the code examines PIDNS_ADDING"
is "interesting". I mean it is fine, but at the end of this series it doesn't
matter what we check first, PIDNS_ADDING or fatal_signal_pending() - restart
is not possible in both cases.


As for 17-20... Yes I am biased. But I still think the simple approach I tried
to propose from the very beginning is better. At least simpler, in that you do
not need to worry about all these special cases/reasons for signal_pending().




And you can not imagine how much I hate "[PATCH 19/20] fork: Have new threads
join on-going signal group stops" ;) Because I spent HOURS looking at this trivial
patch and I am still not sure...

To clarify, the CLONE_THREAD with JOBCTL_STOP_PENDING case is simple, I am mostly
worried about JOBCTL_TRAP_STOP/etc with or without CLONE_THREAD, this adds some
subtle changes but unfortunately I failed to find something wrong so I can't argue.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v3 20/20] signal: Don't restart fork when signals come in.
  2018-07-25  3:56                                             ` [PATCH v3 " Eric W. Biederman
  2018-07-26 13:41                                               ` Oleg Nesterov
@ 2018-07-26 16:21                                               ` Oleg Nesterov
  1 sibling, 0 replies; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-26 16:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On 07/24, Eric W. Biederman wrote:
>
> Similarly the current code will also miss blocked
> signals that are delivered to multiple process, as those signals will
> not appear pending during fork.

Well, I still think this needs a separate change and discussion...

Let me repeat, I simply do not know if this is good or not, I don't know
if the current behaviour is by design or it is mistake.

OK, I won't argue but note that your patch doesn't really fix the problem,

> +	spin_lock_irq(&current->sighand->siglock);
> +	if (!(clone_flags & CLONE_THREAD))
> +		hlist_add_head(&delayed.node, &current->signal->multiprocess);
> +	recalc_sigpending();
> +	spin_unlock_irq(&current->sighand->siglock);
> +	retval = -ERESTARTNOINTR;
> +	if (signal_pending(current))
> +		goto fork_out;

because recalc_sigpending() will not notice the blocked multiprocess signal
if it is already pending.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 18/20] signal: Add calculate_sigpending()
  2018-07-26 15:13                               ` Eric W. Biederman
@ 2018-07-26 16:24                                 ` Oleg Nesterov
  0 siblings, 0 replies; 96+ messages in thread
From: Oleg Nesterov @ 2018-07-26 16:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Wen Yang, majiang

On 07/26, Eric W. Biederman wrote:
>
> > Easy to fix, but perhaps we can simply add recalc_sigpending() into
> > schedule_tail() ? It already does more than just finish_task_switch/etc.
> >
> > This way we do not need the new helper (which btw can only be used by
> > copy_process).
>
> The problem I have with reusing recalc_sigpending is that it does not
> set TIF_SIGPENDING if (freezing || klp_patch_pending).

Ah, indeed, you are right.

Oleg.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v3 20/20] signal: Don't restart fork when signals come in.
  2018-07-26 15:55                                                   ` Oleg Nesterov
@ 2018-08-09  6:19                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09  6:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Oleg Nesterov <oleg@redhat.com> writes:

> On 07/26, Eric W. Biederman wrote:
>>
>> Are the earlier patches looking ok to you?
>
> I obviously like 1-15.
>
> "[PATCH 16/20] fork: Move and describe why the code examines PIDNS_ADDING"
> is "interesting". I mean it is fine, but at the end of this series it doesn't
> matter what we check first, PIDNS_ADDING or fatal_signal_pending() - restart
> is not possible in both cases.
>
>
> As for 17-20... Yes I am biased. But I still think the simple approach I tried
> to propose from the very beginning is better. At least simpler, in that you do
> not need to worry about all these special cases/reasons for signal_pending().

I think worrying about them all now results in a future where we don't
have to worry about reasons why we can't let fork continue.  Giving a
better progress guarantee.  Which ultimately should be more maintainable
going forward.

> And you can not imagine how much I hate "[PATCH 19/20] fork: Have new threads
> join on-going signal group stops" ;) Because I spent HOURS looking at this trivial
> patch and I am still not sure...
>
> To clarify, the CLONE_THREAD with JOBCTL_STOP_PENDING case is simple, I am mostly
> worried about JOBCTL_TRAP_STOP/etc with or without CLONE_THREAD, this adds some
> subtle changes but unfortunately I failed to find something wrong so I
> can't argue

I can understand taking a hard look at JOBCTL_TRAP_STOP especially as it
gets mixed in with the multi-task (whole process) stop handling when at
least one of the tasks of a process are being ptraced.  To make certain
I understood your concern I took a second look at it myself.

The ptrace actions are defined to only affect a single task, and except
for multi-task stop handling all of the jobctl bits are used for ptrace
actions.  So I don't see how there is anything we could possibly miss
in the jobctl bits.

Eric



^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 0/6] Not restarting for due to signals.
  2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
                                             ` (21 preceding siblings ...)
  2018-07-25 16:05                           ` Oleg Nesterov
@ 2018-08-09  6:53                           ` Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 1/6] fork: Move and describe why the code examines PIDNS_ADDING Eric W. Biederman
                                               ` (6 more replies)
  22 siblings, 7 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09  6:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, linux-kernel, Wen Yang, majiang, Linus Torvalds


This builds on patches 1-15 of my previous patch posting.  As those are
non-controversial I am not posting them again.

I took longer than I had hoped to get this set together because a kernel
testing robot noticed some random corruption with the way I had been
adding to the list.  I finally tracked it down to failing to remove the
sigset from the list during fork_idle.  So I have made that logic
simpler and use hlist_del_init which will only remove an item from a
list if it was placed on the list in the first place.

I took Oleg's suggesting and moved calculate_sigpending into
schedule_tail where recalc_sigpending an be used directly.  Then in
calculate_sigpending I just unconditionally set TIF_SIGPENDING and allow
recalc_sigpending to clear TIF_SIGPENDING if we don't need it.

I also now handle the stop/continue signal magic where we only let one
of stop signals and SIGCONT be pending at a time.  Looking at it from
first principles dropping one of SIGTSTP SIGTTIN SIGTTOU or SIGCONT
before calling it's handler feels wrong.  I checked and it is our
historical behavior, so I won't even thinking of introducing different
behavior at this point.

Eric W. Biederman (6):
      fork: Move and describe why the code examines PIDNS_ADDING
      fork: Unconditionally exit if a fatal signal is pending
      signal: Add calculate_sigpending()
      fork: Skip setting TIF_SIGPENDING in ptrace_init_task
      fork: Have new threads join on-going signal group stops
      signal: Don't restart fork when signals come in.

 include/linux/ptrace.h       |  2 --
 include/linux/sched/signal.h | 11 ++++++++++
 init/init_task.c             |  1 +
 kernel/fork.c                | 49 ++++++++++++++++++++++++++++++--------------
 kernel/sched/core.c          |  2 ++
 kernel/signal.c              | 43 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 91 insertions(+), 17 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 1/6] fork: Move and describe why the code examines PIDNS_ADDING
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
@ 2018-08-09  6:56                             ` Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 2/6] fork: Unconditionally exit if a fatal signal is pending Eric W. Biederman
                                               ` (5 subsequent siblings)
  6 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09  6:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, linux-kernel, Wen Yang, majiang, Linus Torvalds,
	Eric W. Biederman

Normally this would be something that would be handled by handling
signals that are sent to a group of processes but in this case the
forking process is not a member of the group being signaled.  Thus
special code is needed to prevent a race with pid namespaces exiting,
and fork adding new processes within them.

Move this test up before the signal restart just in case signals are
also pending.  Fatal conditions should take presedence over restarts.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index cc5be0d01ce6..b9c54318a292 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1922,6 +1922,12 @@ static __latent_entropy struct task_struct *copy_process(
 
 	rseq_fork(p, clone_flags);
 
+	/* Don't start children in a dying pid namespace */
+	if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
+		retval = -ENOMEM;
+		goto bad_fork_cancel_cgroup;
+	}
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
@@ -1935,10 +1941,7 @@ static __latent_entropy struct task_struct *copy_process(
 		retval = -ERESTARTNOINTR;
 		goto bad_fork_cancel_cgroup;
 	}
-	if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
-		retval = -ENOMEM;
-		goto bad_fork_cancel_cgroup;
-	}
+
 
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 2/6] fork: Unconditionally exit if a fatal signal is pending
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 1/6] fork: Move and describe why the code examines PIDNS_ADDING Eric W. Biederman
@ 2018-08-09  6:56                             ` Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 3/6] signal: Add calculate_sigpending() Eric W. Biederman
                                               ` (4 subsequent siblings)
  6 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09  6:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, linux-kernel, Wen Yang, majiang, Linus Torvalds,
	Eric W. Biederman

In practice this does not change anything as testing for fatal_signal_pending
and exiting for with an error code duplicates the work of the next clause
which recalculates pending signals and then exits fork if any are pending.
In both cases the pending signal will trigger the slow path when existing
to userspace, and the fatal signal will cause do_exit to be called.

The advantage of making this a separate test is that it makes it clear
processing the fatal signal will terminate the fork, and it allows the
rest of the signal logic to be updated without fear that this important
case will be lost.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index b9c54318a292..22d4cdb9a7ca 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1928,6 +1928,12 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
+	/* Let kill terminate clone/fork in the middle */
+	if (fatal_signal_pending(current)) {
+		retval = -EINTR;
+		goto bad_fork_cancel_cgroup;
+	}
+
 	/*
 	 * Process group and session signals need to be delivered to just the
 	 * parent before the fork or both the parent and the child after the
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 3/6] signal: Add calculate_sigpending()
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 1/6] fork: Move and describe why the code examines PIDNS_ADDING Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 2/6] fork: Unconditionally exit if a fatal signal is pending Eric W. Biederman
@ 2018-08-09  6:56                             ` Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 4/6] fork: Skip setting TIF_SIGPENDING in ptrace_init_task Eric W. Biederman
                                               ` (3 subsequent siblings)
  6 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09  6:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, linux-kernel, Wen Yang, majiang, Linus Torvalds,
	Eric W. Biederman

Add a function calculate_sigpending to test to see if any signals are
pending for a new task immediately following fork.  Signals have to
happen either before or after fork.  Today our practice is to push
all of the signals to before the fork, but that has the downside that
frequent or periodic signals can make fork take much much longer than
normal or prevent fork from completing entirely.

So we need move signals that we can after the fork to prevent that.

This updates the code to set TIF_SIGPENDING on a new task if there
are signals or other activities that have moved so that they appear
to happen after the fork.

As the code today restarts if it sees any such activity this won't
immediately have an effect, as there will be no reason for it
to set TIF_SIGPENDING immediately after the fork.

Adding calculate_sigpending means the code in fork can safely be
changed to not always restart if a signal is pending.

The new calculate_sigpending function sets sigpending if there
are pending bits in jobctl, pending signals, the freezer needs
to freeze the new task or the live kernel patching framework
need the new thread to take the slow path to userspace.

I have verified that setting TIF_SIGPENDING does make a new process
take the slow path to userspace before it executes it's first userspace
instruction.

I have looked at the callers of signal_wake_up and the code paths
setting TIF_SIGPENDING and I don't see anything else that needs to be
handled.  The code probably doesn't need to set TIF_SIGPENDING for the
kernel live patching as it uses a separate thread flag as well.  But
at this point it seems safer reuse the recalc_sigpending logic and get
the kernel live patching folks to sort out their story later.

V2: I have moved the test into schedule_tail where siglock can
    be grabbed and recalc_sigpending can be reused directly.
    Further as the last action of setting up a new task this
    guarantees that TIF_SIGPENDING will be properly set in the
    new process.

    The helper calculate_sigpending takes the siglock and
    uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
    clear TIF_SIGPENDING if it is unnecessary.  This allows reusing
    the existing code and keeps maintenance of the conditions simple.

    Oleg Nesterov <oleg@redhat.com>  suggested the movement
    and pointed out the need to take siglock if this code
    was going to be called while the new task is discoverable.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  1 +
 kernel/sched/core.c          |  2 ++
 kernel/signal.c              | 11 +++++++++++
 3 files changed, 14 insertions(+)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 94558ffa82ab..b55fd293c1e5 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -372,6 +372,7 @@ static inline int signal_pending_state(long state, struct task_struct *p)
  */
 extern void recalc_sigpending_and_wake(struct task_struct *t);
 extern void recalc_sigpending(void);
+extern void calculate_sigpending(void);
 
 extern void signal_wake_up_state(struct task_struct *t, unsigned int state);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78d8facba456..3e4ed4b7aa2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2813,6 +2813,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 
 	if (current->set_child_tid)
 		put_user(task_pid_vnr(current), current->set_child_tid);
+
+	calculate_sigpending();
 }
 
 /*
diff --git a/kernel/signal.c b/kernel/signal.c
index dddbea558455..1e06f1eba363 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -172,6 +172,17 @@ void recalc_sigpending(void)
 
 }
 
+void calculate_sigpending(void)
+{
+	/* Have any signals or users of TIF_SIGPENDING been delayed
+	 * until after fork?
+	 */
+	spin_lock_irq(&current->sighand->siglock);
+	set_tsk_thread_flag(current, TIF_SIGPENDING);
+	recalc_sigpending();
+	spin_unlock_irq(&current->sighand->siglock);
+}
+
 /* Given the mask, find the first available signal that should be serviced. */
 
 #define SYNCHRONOUS_MASK \
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 4/6] fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
                                               ` (2 preceding siblings ...)
  2018-08-09  6:56                             ` [PATCH v5 3/6] signal: Add calculate_sigpending() Eric W. Biederman
@ 2018-08-09  6:56                             ` Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 5/6] fork: Have new threads join on-going signal group stops Eric W. Biederman
                                               ` (2 subsequent siblings)
  6 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09  6:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, linux-kernel, Wen Yang, majiang, Linus Torvalds,
	Eric W. Biederman

The code in calculate_sigpending will now handle this so
it is just redundant and possibly a little confusing
to continue setting TIF_SIGPENDING in ptrace_init_task.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/ptrace.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 037bf0ef1ae9..4f36431c380b 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -214,8 +214,6 @@ static inline void ptrace_init_task(struct task_struct *child, bool ptrace)
 			task_set_jobctl_pending(child, JOBCTL_TRAP_STOP);
 		else
 			sigaddset(&child->pending.signal, SIGSTOP);
-
-		set_tsk_thread_flag(child, TIF_SIGPENDING);
 	}
 	else
 		child->ptracer_cred = NULL;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 5/6] fork: Have new threads join on-going signal group stops
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
                                               ` (3 preceding siblings ...)
  2018-08-09  6:56                             ` [PATCH v5 4/6] fork: Skip setting TIF_SIGPENDING in ptrace_init_task Eric W. Biederman
@ 2018-08-09  6:56                             ` Eric W. Biederman
  2018-08-09  6:56                             ` [PATCH v5 6/6] signal: Don't restart fork when signals come in Eric W. Biederman
  2018-08-09 17:16                             ` [PATCH v5 0/6] Not restarting for due to signals Linus Torvalds
  6 siblings, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09  6:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, linux-kernel, Wen Yang, majiang, Linus Torvalds,
	Eric W. Biederman

There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL.  Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered.  Which
leaves only SIGSTOP that needs to be considered when creating new
threads.

Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.

A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.

It is easy so check to see if a SIGSTOP is ongoing and have the new
thread join it immediate after the clone completes.  Making it appear
the clone completed happened just before the SIGSTOP.

The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.

V2: The call to task_join_group_stop was moved before the new task is
    added to the thread group list.  This should not matter as
    sighand->siglock is held over both the addition of the threads,
    the call to task_join_group_stop and do_signal_stop.  But the change
    is trivial and it is one less thing to worry about when reading
    the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  2 ++
 kernel/fork.c                | 27 +++++++++++++++------------
 kernel/signal.c              | 14 ++++++++++++++
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index b55fd293c1e5..ae2b0b81be25 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -385,6 +385,8 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 	signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
 }
 
+void task_join_group_stop(struct task_struct *task);
+
 #ifdef TIF_RESTORE_SIGMASK
 /*
  * Legacy restore_sigmask accessors.  These are inefficient on
diff --git a/kernel/fork.c b/kernel/fork.c
index 22d4cdb9a7ca..ab731e15a600 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1934,18 +1934,20 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
-	/*
-	 * Process group and session signals need to be delivered to just the
-	 * parent before the fork or both the parent and the child after the
-	 * fork. Restart if a signal comes in before we add the new process to
-	 * it's process group.
-	 * A fatal signal pending means that current will exit, so the new
-	 * thread can't slip out of an OOM kill (or normal SIGKILL).
-	*/
-	recalc_sigpending();
-	if (signal_pending(current)) {
-		retval = -ERESTARTNOINTR;
-		goto bad_fork_cancel_cgroup;
+	if (!(clone_flags & CLONE_THREAD)) {
+		/*
+		 * Process group and session signals need to be delivered to just the
+		 * parent before the fork or both the parent and the child after the
+		 * fork. Restart if a signal comes in before we add the new process to
+		 * it's process group.
+		 * A fatal signal pending means that current will exit, so the new
+		 * thread can't slip out of an OOM kill (or normal SIGKILL).
+		 */
+		recalc_sigpending();
+		if (signal_pending(current)) {
+			retval = -ERESTARTNOINTR;
+			goto bad_fork_cancel_cgroup;
+		}
 	}
 
 
@@ -1982,6 +1984,7 @@ static __latent_entropy struct task_struct *copy_process(
 			current->signal->nr_threads++;
 			atomic_inc(&current->signal->live);
 			atomic_inc(&current->signal->sigcnt);
+			task_join_group_stop(p);
 			list_add_tail_rcu(&p->thread_group,
 					  &p->group_leader->thread_group);
 			list_add_tail_rcu(&p->thread_node,
diff --git a/kernel/signal.c b/kernel/signal.c
index 1e06f1eba363..9f0eafb6d474 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -373,6 +373,20 @@ static bool task_participate_group_stop(struct task_struct *task)
 	return false;
 }
 
+void task_join_group_stop(struct task_struct *task)
+{
+	/* Have the new thread join an on-going signal group stop */
+	unsigned long jobctl = current->jobctl;
+	if (jobctl & JOBCTL_STOP_PENDING) {
+		struct signal_struct *sig = current->signal;
+		unsigned long signr = jobctl & JOBCTL_STOP_SIGMASK;
+		unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME;
+		if (task_set_jobctl_pending(task, signr | gstop)) {
+			sig->group_stop_count++;
+		}
+	}
+}
+
 /*
  * allocate a new signal queue record
  * - this may be called without locks if and only if t == current, otherwise an
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 6/6] signal: Don't restart fork when signals come in.
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
                                               ` (4 preceding siblings ...)
  2018-08-09  6:56                             ` [PATCH v5 5/6] fork: Have new threads join on-going signal group stops Eric W. Biederman
@ 2018-08-09  6:56                             ` Eric W. Biederman
  2018-08-09 17:15                               ` Linus Torvalds
  2018-08-09 17:16                             ` [PATCH v5 0/6] Not restarting for due to signals Linus Torvalds
  6 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09  6:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, linux-kernel, Wen Yang, majiang, Linus Torvalds,
	Eric W. Biederman

Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.

The code was being overly pesimistic.  Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child.  For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().

While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread.  Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.

Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes.  When fork completes initialize the new processes
shared_pending signal set with it.  The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals.  Making it
appear as if those signals were received immediately after the fork.

It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo.  This
means it is safe to not bother collecting siginfo on signals sent
during fork.

The sigaction of a child of fork is initially the same as the
sigaction of the parent process.  So a signal the parent ignores the
child will also initially ignore.  Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.

Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with.  They won't cause any problems.

V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
    - Initialize signal_struct.multiprocess in init_task
    - Move setting of shared_pending to before the new task
      is visible to signals.  This prevents signals from comming
      in before shared_pending.signal is set to delayed.signal
      and being lost.
V5: - rework list add and delete to account for idle threads

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  8 +++++++
 init/init_task.c             |  1 +
 kernel/fork.c                | 43 +++++++++++++++++++++---------------
 kernel/signal.c              | 18 +++++++++++++++
 4 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index ae2b0b81be25..4e9b77fb702d 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -69,6 +69,11 @@ struct thread_group_cputimer {
 	bool checking_timer;
 };
 
+struct multiprocess_signals {
+	sigset_t signal;
+	struct hlist_node node;
+};
+
 /*
  * NOTE! "signal_struct" does not have its own
  * locking, because a shared signal_struct always
@@ -90,6 +95,9 @@ struct signal_struct {
 	/* shared signal handling: */
 	struct sigpending	shared_pending;
 
+	/* For collecting multiprocess signals during fork */
+	struct hlist_head	multiprocess;
+
 	/* thread group exit support */
 	int			group_exit_code;
 	/* overloaded:
diff --git a/init/init_task.c b/init/init_task.c
index 4f97846256d7..5aebe3be4d7c 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -22,6 +22,7 @@ static struct signal_struct init_signals = {
 		.list = LIST_HEAD_INIT(init_signals.shared_pending.list),
 		.signal =  {{0}}
 	},
+	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/fork.c b/kernel/fork.c
index ab731e15a600..411e34acace7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1456,6 +1456,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	init_waitqueue_head(&sig->wait_chldexit);
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
+	INIT_HLIST_HEAD(&sig->multiprocess);
 	seqlock_init(&sig->stats_lock);
 	prev_cputime_init(&sig->prev_cputime);
 
@@ -1602,6 +1603,7 @@ static __latent_entropy struct task_struct *copy_process(
 {
 	int retval;
 	struct task_struct *p;
+	struct multiprocess_signals delayed;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1649,6 +1651,24 @@ static __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * Force any signals received before this point to be delivered
+	 * before the fork happens.  Collect up signals sent to multiple
+	 * processes that happen during the fork and delay them so that
+	 * they appear to happen after the fork.
+	 */
+	sigemptyset(&delayed.signal);
+	INIT_HLIST_NODE(&delayed.node);
+
+	spin_lock_irq(&current->sighand->siglock);
+	if (!(clone_flags & CLONE_THREAD))
+		hlist_add_head(&delayed.node, &current->signal->multiprocess);
+	recalc_sigpending();
+	spin_unlock_irq(&current->sighand->siglock);
+	retval = -ERESTARTNOINTR;
+	if (signal_pending(current))
+		goto fork_out;
+
 	retval = -ENOMEM;
 	p = dup_task_struct(current, node);
 	if (!p)
@@ -1934,22 +1954,6 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
-	if (!(clone_flags & CLONE_THREAD)) {
-		/*
-		 * Process group and session signals need to be delivered to just the
-		 * parent before the fork or both the parent and the child after the
-		 * fork. Restart if a signal comes in before we add the new process to
-		 * it's process group.
-		 * A fatal signal pending means that current will exit, so the new
-		 * thread can't slip out of an OOM kill (or normal SIGKILL).
-		 */
-		recalc_sigpending();
-		if (signal_pending(current)) {
-			retval = -ERESTARTNOINTR;
-			goto bad_fork_cancel_cgroup;
-		}
-	}
-
 
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
@@ -1965,7 +1969,7 @@ static __latent_entropy struct task_struct *copy_process(
 				ns_of_pid(pid)->child_reaper = p;
 				p->signal->flags |= SIGNAL_UNKILLABLE;
 			}
-
+			p->signal->shared_pending.signal = delayed.signal;
 			p->signal->tty = tty_kref_get(current->signal->tty);
 			/*
 			 * Inherit has_child_subreaper flag under the same
@@ -1993,8 +1997,8 @@ static __latent_entropy struct task_struct *copy_process(
 		attach_pid(p, PIDTYPE_PID);
 		nr_threads++;
 	}
-
 	total_forks++;
+	hlist_del_init(&delayed.node);
 	spin_unlock(&current->sighand->siglock);
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
@@ -2059,6 +2063,9 @@ static __latent_entropy struct task_struct *copy_process(
 	put_task_stack(p);
 	free_task(p);
 fork_out:
+	spin_lock_irq(&current->sighand->siglock);
+	hlist_del_init(&delayed.node);
+	spin_unlock_irq(&current->sighand->siglock);
 	return ERR_PTR(retval);
 }
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 9f0eafb6d474..850885db2c1e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1121,6 +1121,24 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
+
+	/* Let multiprocess signals appear after on-going forks */
+	if (type > PIDTYPE_TGID) {
+		struct multiprocess_signals *delayed;
+		hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
+			sigset_t *signal = &delayed->signal;
+			/* Can't queue both a stop and a continue signal */
+			if (sig == SIGCONT) {
+				sigset_t flush;
+				siginitset(&flush, SIG_KERNEL_STOP_MASK);
+				sigandnsets(signal, signal, &flush);
+			} else if (sig_kernel_stop(sig)) {
+				sigdelset(signal, SIGCONT);
+			}
+			sigaddset(signal, sig);
+		}
+	}
+
 	complete_signal(sig, t, type);
 ret:
 	trace_signal_generate(sig, info, t, type != PIDTYPE_PID, result);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 6/6] signal: Don't restart fork when signals come in.
  2018-08-09  6:56                             ` [PATCH v5 6/6] signal: Don't restart fork when signals come in Eric W. Biederman
@ 2018-08-09 17:15                               ` Linus Torvalds
  2018-08-09 17:42                                 ` Eric W. Biederman
  2018-08-09 18:09                                 ` [PATCH v6 " Eric W. Biederman
  0 siblings, 2 replies; 96+ messages in thread
From: Linus Torvalds @ 2018-08-09 17:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Wed, Aug 8, 2018 at 11:57 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The code was being overly pesimistic.

Pessimistic.

> +       if (type > PIDTYPE_TGID) {
> +               struct multiprocess_signals *delayed;
> +               hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
> +                       sigset_t *signal = &delayed->signal;
> +                       /* Can't queue both a stop and a continue signal */
> +                       if (sig == SIGCONT) {
> +                               sigset_t flush;
> +                               siginitset(&flush, SIG_KERNEL_STOP_MASK);
> +                               sigandnsets(signal, signal, &flush);

This looks odd and unnecessary.

Why isn't this just a

        sigdelsetmask(signal, SIG_KERNEL_STOP_MASK);

since all of the traditional stop bits should be in the low mask.

I see that we apparently have this stupid pattern elsewhere too, and
it looks like it's because we stupidly say "are the RT signals in the
non-legacy set", when that definitely cannot be the case for the (very
much legacy) tty flow control signals.

                 Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 0/6] Not restarting for due to signals.
  2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
                                               ` (5 preceding siblings ...)
  2018-08-09  6:56                             ` [PATCH v5 6/6] signal: Don't restart fork when signals come in Eric W. Biederman
@ 2018-08-09 17:16                             ` Linus Torvalds
  6 siblings, 0 replies; 96+ messages in thread
From: Linus Torvalds @ 2018-08-09 17:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

On Wed, Aug 8, 2018 at 11:53 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> This builds on patches 1-15 of my previous patch posting.  As those are
> non-controversial I am not posting them again.

Other than the one nit I just replied about, this looks fine to me.

              Linus

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 6/6] signal: Don't restart fork when signals come in.
  2018-08-09 17:15                               ` Linus Torvalds
@ 2018-08-09 17:42                                 ` Eric W. Biederman
  2018-08-09 18:09                                 ` [PATCH v6 " Eric W. Biederman
  1 sibling, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09 17:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Aug 8, 2018 at 11:57 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The code was being overly pesimistic.
>
> Pessimistic.

>> +       if (type > PIDTYPE_TGID) {
>> +               struct multiprocess_signals *delayed;
>> +               hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
>> +                       sigset_t *signal = &delayed->signal;
>> +                       /* Can't queue both a stop and a continue signal */
>> +                       if (sig == SIGCONT) {
>> +                               sigset_t flush;
>> +                               siginitset(&flush, SIG_KERNEL_STOP_MASK);
>> +                               sigandnsets(signal, signal, &flush);
>
> This looks odd and unnecessary.
>
> Why isn't this just a
>
>         sigdelsetmask(signal, SIG_KERNEL_STOP_MASK);
>
> since all of the traditional stop bits should be in the low mask.
>
> I see that we apparently have this stupid pattern elsewhere too, and
> it looks like it's because we stupidly say "are the RT signals in the
> non-legacy set", when that definitely cannot be the case for the (very
> much legacy) tty flow control signals.

I just missed the existence of sigdelsetmask when I was putting this
together.

I will fix that and unless someone sees an issue I will queue this up
for linux-next.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v6 6/6] signal: Don't restart fork when signals come in.
  2018-08-09 17:15                               ` Linus Torvalds
  2018-08-09 17:42                                 ` Eric W. Biederman
@ 2018-08-09 18:09                                 ` Eric W. Biederman
  1 sibling, 0 replies; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-09 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List,
	Wen Yang, majiang


Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.

The code was being overly pessimistic.  Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child.  For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().

While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread.  Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.

Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes.  When fork completes initialize the new processes
shared_pending signal set with it.  The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals.  Making it
appear as if those signals were received immediately after the fork.

It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo.  This
means it is safe to not bother collecting siginfo on signals sent
during fork.

The sigaction of a child of fork is initially the same as the
sigaction of the parent process.  So a signal the parent ignores the
child will also initially ignore.  Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.

Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with.  They won't cause any problems.

V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
    - Initialize signal_struct.multiprocess in init_task
    - Move setting of shared_pending to before the new task
      is visible to signals.  This prevents signals from comming
      in before shared_pending.signal is set to delayed.signal
      and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h |  8 +++++++
 init/init_task.c             |  1 +
 kernel/fork.c                | 43 +++++++++++++++++++++---------------
 kernel/signal.c              | 15 +++++++++++++
 4 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index ae2b0b81be25..4e9b77fb702d 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -69,6 +69,11 @@ struct thread_group_cputimer {
 	bool checking_timer;
 };
 
+struct multiprocess_signals {
+	sigset_t signal;
+	struct hlist_node node;
+};
+
 /*
  * NOTE! "signal_struct" does not have its own
  * locking, because a shared signal_struct always
@@ -90,6 +95,9 @@ struct signal_struct {
 	/* shared signal handling: */
 	struct sigpending	shared_pending;
 
+	/* For collecting multiprocess signals during fork */
+	struct hlist_head	multiprocess;
+
 	/* thread group exit support */
 	int			group_exit_code;
 	/* overloaded:
diff --git a/init/init_task.c b/init/init_task.c
index 4f97846256d7..5aebe3be4d7c 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -22,6 +22,7 @@ static struct signal_struct init_signals = {
 		.list = LIST_HEAD_INIT(init_signals.shared_pending.list),
 		.signal =  {{0}}
 	},
+	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/fork.c b/kernel/fork.c
index ab731e15a600..411e34acace7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1456,6 +1456,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	init_waitqueue_head(&sig->wait_chldexit);
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
+	INIT_HLIST_HEAD(&sig->multiprocess);
 	seqlock_init(&sig->stats_lock);
 	prev_cputime_init(&sig->prev_cputime);
 
@@ -1602,6 +1603,7 @@ static __latent_entropy struct task_struct *copy_process(
 {
 	int retval;
 	struct task_struct *p;
+	struct multiprocess_signals delayed;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1649,6 +1651,24 @@ static __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	/*
+	 * Force any signals received before this point to be delivered
+	 * before the fork happens.  Collect up signals sent to multiple
+	 * processes that happen during the fork and delay them so that
+	 * they appear to happen after the fork.
+	 */
+	sigemptyset(&delayed.signal);
+	INIT_HLIST_NODE(&delayed.node);
+
+	spin_lock_irq(&current->sighand->siglock);
+	if (!(clone_flags & CLONE_THREAD))
+		hlist_add_head(&delayed.node, &current->signal->multiprocess);
+	recalc_sigpending();
+	spin_unlock_irq(&current->sighand->siglock);
+	retval = -ERESTARTNOINTR;
+	if (signal_pending(current))
+		goto fork_out;
+
 	retval = -ENOMEM;
 	p = dup_task_struct(current, node);
 	if (!p)
@@ -1934,22 +1954,6 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
-	if (!(clone_flags & CLONE_THREAD)) {
-		/*
-		 * Process group and session signals need to be delivered to just the
-		 * parent before the fork or both the parent and the child after the
-		 * fork. Restart if a signal comes in before we add the new process to
-		 * it's process group.
-		 * A fatal signal pending means that current will exit, so the new
-		 * thread can't slip out of an OOM kill (or normal SIGKILL).
-		 */
-		recalc_sigpending();
-		if (signal_pending(current)) {
-			retval = -ERESTARTNOINTR;
-			goto bad_fork_cancel_cgroup;
-		}
-	}
-
 
 	init_task_pid_links(p);
 	if (likely(p->pid)) {
@@ -1965,7 +1969,7 @@ static __latent_entropy struct task_struct *copy_process(
 				ns_of_pid(pid)->child_reaper = p;
 				p->signal->flags |= SIGNAL_UNKILLABLE;
 			}
-
+			p->signal->shared_pending.signal = delayed.signal;
 			p->signal->tty = tty_kref_get(current->signal->tty);
 			/*
 			 * Inherit has_child_subreaper flag under the same
@@ -1993,8 +1997,8 @@ static __latent_entropy struct task_struct *copy_process(
 		attach_pid(p, PIDTYPE_PID);
 		nr_threads++;
 	}
-
 	total_forks++;
+	hlist_del_init(&delayed.node);
 	spin_unlock(&current->sighand->siglock);
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
@@ -2059,6 +2063,9 @@ static __latent_entropy struct task_struct *copy_process(
 	put_task_stack(p);
 	free_task(p);
 fork_out:
+	spin_lock_irq(&current->sighand->siglock);
+	hlist_del_init(&delayed.node);
+	spin_unlock_irq(&current->sighand->siglock);
 	return ERR_PTR(retval);
 }
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 9f0eafb6d474..cfa9d10e731a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1121,6 +1121,21 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
+
+	/* Let multiprocess signals appear after on-going forks */
+	if (type > PIDTYPE_TGID) {
+		struct multiprocess_signals *delayed;
+		hlist_for_each_entry(delayed, &t->signal->multiprocess, node) {
+			sigset_t *signal = &delayed->signal;
+			/* Can't queue both a stop and a continue signal */
+			if (sig == SIGCONT)
+				sigdelsetmask(signal, SIG_KERNEL_STOP_MASK);
+			else if (sig_kernel_stop(sig))
+				sigdelset(signal, SIGCONT);
+			sigaddset(signal, sig);
+		}
+	}
+
 	complete_signal(sig, t, type);
 ret:
 	trace_signal_generate(sig, info, t, type != PIDTYPE_PID, result);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH] signal: Don't send signals to tasks that don't exist
  2018-07-24  3:24                           ` [PATCH 07/20] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent Eric W. Biederman
@ 2018-08-16  4:04                             ` Eric W. Biederman
  2018-08-17 17:34                               ` Dmitry Vyukov
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-16  4:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, linux-kernel, Wen Yang, majiang,
	J. Bruce Fields, syzkaller-bugs


Recently syzbot reported crashes in send_sigio_to_task and
send_sigurg_to_task in linux-next.  Despite finding a reproducer
syzbot apparently did not bisected this or otherwise track down the
offending commit in linux-next.

I happened to see this report and examined the code because I had
recently changed these functions as part of making PIDTYPE_TGID a real
pid type so that fork would does not need to restart when receiving a
signal.  By examination I see that I spotted a bug in the code
that could explain the reported crashes.

When I took Oleg's suggestion and optimized send_sigurg and send_sigio
to only send to a single task when type is PIDTYPE_PID or PIDTYPE_TGID
I failed to handle pids that no longer point to tasks.  The macro
do_each_pid_task simply iterates for zero iterations.  With pid_task
an explicit NULL test is needed.

Update the code to include the missing NULL test.

Fixes: 019191342fec ("signal: Use PIDTYPE_TGID to clearly store where file signals will be sent")
Reported-by: syzkaller-bugs@googlegroups.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fcntl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index a04accf6847f..4137d96534a6 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -791,7 +791,8 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
 	if (type <= PIDTYPE_TGID) {
 		rcu_read_lock();
 		p = pid_task(pid, PIDTYPE_PID);
-		send_sigio_to_task(p, fown, fd, band, type);
+		if (p)
+			send_sigio_to_task(p, fown, fd, band, type);
 		rcu_read_unlock();
 	} else {
 		read_lock(&tasklist_lock);
@@ -830,7 +831,8 @@ int send_sigurg(struct fown_struct *fown)
 	if (type <= PIDTYPE_TGID) {
 		rcu_read_lock();
 		p = pid_task(pid, PIDTYPE_PID);
-		send_sigurg_to_task(p, fown, type);
+		if (p)
+			send_sigurg_to_task(p, fown, type);
 		rcu_read_unlock();
 	} else {
 		read_lock(&tasklist_lock);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] signal: Don't send signals to tasks that don't exist
  2018-08-16  4:04                             ` [PATCH] signal: Don't send signals to tasks that don't exist Eric W. Biederman
@ 2018-08-17 17:34                               ` Dmitry Vyukov
  2018-08-17 18:46                                 ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Dmitry Vyukov @ 2018-08-17 17:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Wen Yang,
	majiang, J. Bruce Fields, syzkaller-bugs, syzbot

On Wed, Aug 15, 2018 at 9:04 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Recently syzbot reported crashes in send_sigio_to_task and
> send_sigurg_to_task in linux-next.  Despite finding a reproducer
> syzbot apparently did not bisected this or otherwise track down the
> offending commit in linux-next.
>
> I happened to see this report and examined the code because I had
> recently changed these functions as part of making PIDTYPE_TGID a real
> pid type so that fork would does not need to restart when receiving a
> signal.  By examination I see that I spotted a bug in the code
> that could explain the reported crashes.
>
> When I took Oleg's suggestion and optimized send_sigurg and send_sigio
> to only send to a single task when type is PIDTYPE_PID or PIDTYPE_TGID
> I failed to handle pids that no longer point to tasks.  The macro
> do_each_pid_task simply iterates for zero iterations.  With pid_task
> an explicit NULL test is needed.
>
> Update the code to include the missing NULL test.
>
> Fixes: 019191342fec ("signal: Use PIDTYPE_TGID to clearly store where file signals will be sent")
> Reported-by: syzkaller-bugs@googlegroups.com

Since the commit does not contain the syzbot-provided Reported-by tag,
we need to tell syzbot that this is fixed explicitly:

#syz fix: signal: Don't send signals to tasks that don't exist

> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/fcntl.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index a04accf6847f..4137d96534a6 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -791,7 +791,8 @@ void send_sigio(struct fown_struct *fown, int fd, int band)
>         if (type <= PIDTYPE_TGID) {
>                 rcu_read_lock();
>                 p = pid_task(pid, PIDTYPE_PID);
> -               send_sigio_to_task(p, fown, fd, band, type);
> +               if (p)
> +                       send_sigio_to_task(p, fown, fd, band, type);
>                 rcu_read_unlock();
>         } else {
>                 read_lock(&tasklist_lock);
> @@ -830,7 +831,8 @@ int send_sigurg(struct fown_struct *fown)
>         if (type <= PIDTYPE_TGID) {
>                 rcu_read_lock();
>                 p = pid_task(pid, PIDTYPE_PID);
> -               send_sigurg_to_task(p, fown, type);
> +               if (p)
> +                       send_sigurg_to_task(p, fown, type);
>                 rcu_read_unlock();
>         } else {
>                 read_lock(&tasklist_lock);
> --
> 2.17.1
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/87k1orgdoo.fsf_-_%40xmission.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] signal: Don't send signals to tasks that don't exist
  2018-08-17 17:34                               ` Dmitry Vyukov
@ 2018-08-17 18:46                                 ` Eric W. Biederman
  2018-08-17 19:24                                   ` Andrew Morton
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-17 18:46 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Wen Yang,
	majiang, J. Bruce Fields, syzkaller-bugs, Andrey Vagin,
	Cyrill Gorcunov

Dmitry Vyukov <dvyukov@google.com> writes:

> On Wed, Aug 15, 2018 at 9:04 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Recently syzbot reported crashes in send_sigio_to_task and
>> send_sigurg_to_task in linux-next.  Despite finding a reproducer
>> syzbot apparently did not bisected this or otherwise track down the
>> offending commit in linux-next.
>>
>> I happened to see this report and examined the code because I had
>> recently changed these functions as part of making PIDTYPE_TGID a real
>> pid type so that fork would does not need to restart when receiving a
>> signal.  By examination I see that I spotted a bug in the code
>> that could explain the reported crashes.
>>
>> When I took Oleg's suggestion and optimized send_sigurg and send_sigio
>> to only send to a single task when type is PIDTYPE_PID or PIDTYPE_TGID
>> I failed to handle pids that no longer point to tasks.  The macro
>> do_each_pid_task simply iterates for zero iterations.  With pid_task
>> an explicit NULL test is needed.
>>
>> Update the code to include the missing NULL test.
>>
>> Fixes: 019191342fec ("signal: Use PIDTYPE_TGID to clearly store where file signals will be sent")
>> Reported-by: syzkaller-bugs@googlegroups.com
>
> Since the commit does not contain the syzbot-provided Reported-by tag,
> we need to tell syzbot that this is fixed explicitly:

Nor will my commits ever contain that information.  That is information
only of use to syzbot.  That is not information useful to anyone else.

Further syzbot did not track this down and report this.  Syzbot said
something is fishy here and happened to CC a public mailing list.  Only
by chance did I see the report.  There was enough information to start
an investigation but it certainly was not any kind of useful bug report.

It is very annoying that despite syzbot claming to have a reproducer
syzbot completely failed to locate the problem commit or the proper
people to repor the issue to.  I looked at the syzbot website link and
there was no evidence that syzbot even tried to track down which branch
in linux-next the commit came from.  Much less to identify the commit on
that branch.

Very annoyingly syzbot sent out emails and report this before it found a
reproducer.  This is despite several people explicitly asking syzbot
to not report issuing on linux-next where syzbot does not have a
reproducer and it can not track down the offending commit.

> #syz fix: signal: Don't send signals to tasks that don't exist

Private internal communication on a public mailing list is rude.  Please
cut it out.

I appreciate the eagerness to report bugs.  But to play well with others
and not waste developers valuable time syzbot needs to track down the
offending commits and track fixes tags like everyone else.  Special
magic syzbot tags are annoying noise.

I will give credit where credit is due.  But syzbot is not so valuable
it can set rules for everyone else.  Automation is valuable when it
removes work.  Syzbot is not doing a good job at making the most of
developers limited time.

Cyrill Gorconov and Andrey Vagin did a much better job in tracking this
down and reporting this.  They just took a little bit longer.  Please
look at what they sent if you need an example of a useful bug report
looks like.

Eric

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] signal: Don't send signals to tasks that don't exist
  2018-08-17 18:46                                 ` Eric W. Biederman
@ 2018-08-17 19:24                                   ` Andrew Morton
  2018-08-18  5:51                                     ` Eric W. Biederman
  0 siblings, 1 reply; 96+ messages in thread
From: Andrew Morton @ 2018-08-17 19:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dmitry Vyukov, Linus Torvalds, Oleg Nesterov, LKML, Wen Yang,
	majiang, J. Bruce Fields, syzkaller-bugs, Andrey Vagin,
	Cyrill Gorcunov

On Fri, 17 Aug 2018 13:46:36 -0500 ebiederm@xmission.com (Eric W. Biederman) wrote:

> Dmitry Vyukov <dvyukov@google.com> writes:
> 
> > On Wed, Aug 15, 2018 at 9:04 PM, Eric W. Biederman
> > <ebiederm@xmission.com> wrote:
> >>
> >> Recently syzbot reported crashes in send_sigio_to_task and
> >> send_sigurg_to_task in linux-next.  Despite finding a reproducer
> >> syzbot apparently did not bisected this or otherwise track down the
> >> offending commit in linux-next.
> >>
> >> I happened to see this report and examined the code because I had
> >> recently changed these functions as part of making PIDTYPE_TGID a real
> >> pid type so that fork would does not need to restart when receiving a
> >> signal.  By examination I see that I spotted a bug in the code
> >> that could explain the reported crashes.
> >>
> >> When I took Oleg's suggestion and optimized send_sigurg and send_sigio
> >> to only send to a single task when type is PIDTYPE_PID or PIDTYPE_TGID
> >> I failed to handle pids that no longer point to tasks.  The macro
> >> do_each_pid_task simply iterates for zero iterations.  With pid_task
> >> an explicit NULL test is needed.
> >>
> >> Update the code to include the missing NULL test.
> >>
> >> Fixes: 019191342fec ("signal: Use PIDTYPE_TGID to clearly store where file signals will be sent")
> >> Reported-by: syzkaller-bugs@googlegroups.com
> >
> > Since the commit does not contain the syzbot-provided Reported-by tag,
> > we need to tell syzbot that this is fixed explicitly:
> 
> Nor will my commits ever contain that information.  That is information
> only of use to syzbot.  That is not information useful to anyone else.
> 
> Further syzbot did not track this down and report this.  Syzbot said
> something is fishy here and happened to CC a public mailing list.  Only
> by chance did I see the report.  There was enough information to start
> an investigation but it certainly was not any kind of useful bug report.
> 
> It is very annoying that despite syzbot claming to have a reproducer
> syzbot completely failed to locate the problem commit or the proper
> people to repor the issue to.  I looked at the syzbot website link and
> there was no evidence that syzbot even tried to track down which branch
> in linux-next the commit came from.  Much less to identify the commit on
> that branch.

Dude, lighten up.

These reports are useful.  Even if they don't have a reproducer, we
have a backtrace and we can go look and we have a good chance of fixing
the bug.  And as adding a single-line tag to the commit message helps
the syzbot people keep track of things, why not do it?  It's hardly a big
effort.

They're doing useful things - please don't get all bent out of shape
because things aren't 100% perfect.

> Very annoyingly syzbot sent out emails and report this before it found a
> reproducer.  This is despite several people explicitly asking syzbot
> to not report issuing on linux-next where syzbot does not have a
> reproducer and it can not track down the offending commit.

Dear syzbot, please report linux-next issues when you do not have a
reproducer and/or cannot track down the offending commit.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] signal: Don't send signals to tasks that don't exist
  2018-08-17 19:24                                   ` Andrew Morton
@ 2018-08-18  5:51                                     ` Eric W. Biederman
  2018-08-20 19:26                                       ` J. Bruce Fields
  0 siblings, 1 reply; 96+ messages in thread
From: Eric W. Biederman @ 2018-08-18  5:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Linus Torvalds, Oleg Nesterov, LKML, Wen Yang,
	majiang, J. Bruce Fields, syzkaller-bugs, Andrey Vagin,
	Cyrill Gorcunov

Andrew Morton <akpm@linux-foundation.org> writes:

> Dude, lighten up.

This was in response to being asked by a the maintainers of a bot that has
wasted copious quanties of my time to please not waste their time.

To prevent the wasting of time it was requested that when syzbot would
be enabled on linux-next again that it only report issues when there is
a reproducer, and it can narrow the issue down to at least the branch in
linux-next the issue occured on.  I thought there was general agreement
on that point.

That very much did not happen in this case.  So I heard a request to not
waste the time of a bot from people who have not taken what appear to be
reasonable steps to not waste my time.

I am especially annoyed that the bot despite having a reproducer was not
able to at least narrow it down to the branch in linux-next.

I was dismayed when I saw the syzbot report triggered someone to remove
themselves from MAINTAINERS.

Further I actually received a bug report (not found one by luck when I
was skimming through a mailing list) from another group of people whose
automated testing also found the issue and were able to succesfully root
cause the issue.  So I know good but reports are possible on this issue.

In short.  I have been burned by syzbot.  I don't see evidence of change.
I am grumpy.

Eric


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] signal: Don't send signals to tasks that don't exist
  2018-08-18  5:51                                     ` Eric W. Biederman
@ 2018-08-20 19:26                                       ` J. Bruce Fields
  0 siblings, 0 replies; 96+ messages in thread
From: J. Bruce Fields @ 2018-08-20 19:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Dmitry Vyukov, Linus Torvalds, Oleg Nesterov,
	LKML, Wen Yang, majiang, syzkaller-bugs, Andrey Vagin,
	Cyrill Gorcunov

On Sat, Aug 18, 2018 at 12:51:14AM -0500, Eric W. Biederman wrote:
> I was dismayed when I saw the syzbot report triggered someone to remove
> themselves from MAINTAINERS.

You're talking about my patch?  I think you misread it, I'm not removing
myself from MAINTAINERS.

--b.

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2018-08-20 19:26 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-200447-5873-b2kAsSyE1X@https.bugzilla.kernel.org/>
     [not found] ` <CA+55aFyOaEGc_wjRuAxYZH7D4zi4xUQTqUwMmUzxFJQ__2pXuQ@mail.gmail.com>
     [not found]   ` <87h8l9p7bg.fsf@xmission.com>
     [not found]     ` <20180709104158.GA23796@redhat.com>
     [not found]       ` <87sh4so5jv.fsf@xmission.com>
     [not found]         ` <20180709145726.GA26149@redhat.com>
     [not found]           ` <877em4nxo0.fsf@xmission.com>
     [not found]             ` <CA+55aFwq90_KeRUesktap7L_4Hp3gatKZ28RYw1jXBYeOqUoeA@mail.gmail.com>
     [not found]               ` <87lgakm4ol.fsf@xmission.com>
     [not found]                 ` <CA+55aFz1XFvOgJySKVNQD9VS4hys0J7rozxqd3s5ND0z80tfVw@mail.gmail.com>
     [not found]                   ` <20180710134639.GA2453@redhat.com>
2018-07-10 16:00                     ` [Bug 200447] infinite loop in fork syscall Eric W. Biederman
2018-07-11 12:08                       ` Oleg Nesterov
     [not found]                     ` <CA+55aFxcjSYAj-CZFEuDwiwZyMg+prV_jeP_Vuh06RJA0BboSw@mail.gmail.com>
2018-07-11  2:41                       ` [RFC][PATCH 0/11] PIDTYPE_TGID and fewer fork restarts Eric W. Biederman
2018-07-11  2:44                         ` [RFC][PATCH 01/11] pids: Initialize leader_pid in init_task Eric W. Biederman
2018-07-11  2:44                         ` [RFC][PATCH 02/11] pids: Move task_pid_type into sched/signal.h Eric W. Biederman
2018-07-11  2:44                         ` [RFC][PATCH 03/11] pids: Compute task_tgid using signal->leader_pid Eric W. Biederman
2018-07-11  2:44                         ` [RFC][PATCH 04/11] kvm: Don't open code task_pid in kvm_vcpu_ioctl Eric W. Biederman
2018-07-11  2:44                         ` [RFC][PATCH 05/11] pids: Move the pgrp and session pid pointers from task_struct to signal_struct Eric W. Biederman
2018-07-17 11:59                           ` Oleg Nesterov
2018-07-11  2:44                         ` [RFC][PATCH 06/11] pid: Implement PIDTYPE_TGID Eric W. Biederman
2018-07-11  2:44                         ` [RFC][PATCH 07/11] signal: Deliver group signals via PIDTYPE_TGID not PIDTYPE_PID Eric W. Biederman
2018-07-16 12:51                           ` Oleg Nesterov
2018-07-16 14:50                             ` Eric W. Biederman
2018-07-16 17:17                               ` Linus Torvalds
2018-07-16 18:01                                 ` Eric W. Biederman
2018-07-16 18:40                                   ` Linus Torvalds
2018-07-17  9:56                                   ` Oleg Nesterov
2018-07-17 10:18                                     ` Oleg Nesterov
2018-07-20 23:41                                       ` Eric W. Biederman
2018-07-17 16:38                               ` Linus Torvalds
2018-07-20 23:27                                 ` Eric W. Biederman
2018-07-11  2:44                         ` [RFC][PATCH 08/11] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent Eric W. Biederman
2018-07-11  2:44                         ` [RFC][PATCH 09/11] tty_io: Use do_send_sig_info in __do_SACK to forcibly kill tasks Eric W. Biederman
2018-07-16 14:55                           ` Oleg Nesterov
2018-07-16 15:08                             ` Eric W. Biederman
2018-07-16 16:50                               ` Linus Torvalds
2018-07-16 19:17                                 ` Eric W. Biederman
2018-07-16 19:36                                   ` Linus Torvalds
2018-07-17  1:48                                     ` Eric W. Biederman
2018-07-17 10:58                               ` Oleg Nesterov
2018-07-11  2:44                         ` [RFC][PATCH 10/11] signal: Push pid type from signal senders down into __send_signal Eric W. Biederman
2018-07-11  3:11                           ` Linus Torvalds
2018-07-11  2:44                         ` [RFC][PATCH 11/11] signal: Ignore all but multi-process signals that come in during fork Eric W. Biederman
2018-07-11 14:14                           ` Oleg Nesterov
2018-07-11 16:02                             ` Eric W. Biederman
2018-07-12 13:42                               ` Oleg Nesterov
2018-07-12 17:11                                 ` Eric W. Biederman
2018-07-13 14:51                             ` Eric W. Biederman
2018-07-24  3:22                         ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 01/20] pids: Initialize leader_pid in init_task Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 02/20] pids: Move task_pid_type into sched/signal.h Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 03/20] pids: Compute task_tgid using signal->leader_pid Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 04/20] kvm: Don't open code task_pid in kvm_vcpu_ioctl Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 05/20] pids: Move the pgrp and session pid pointers from task_struct to signal_struct Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 06/20] pid: Implement PIDTYPE_TGID Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 07/20] signal: Use PIDTYPE_TGID to clearly store where file signals will be sent Eric W. Biederman
2018-08-16  4:04                             ` [PATCH] signal: Don't send signals to tasks that don't exist Eric W. Biederman
2018-08-17 17:34                               ` Dmitry Vyukov
2018-08-17 18:46                                 ` Eric W. Biederman
2018-08-17 19:24                                   ` Andrew Morton
2018-08-18  5:51                                     ` Eric W. Biederman
2018-08-20 19:26                                       ` J. Bruce Fields
2018-07-24  3:24                           ` [PATCH 08/20] posix-timers: Noralize good_sigevent Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 09/20] signal: Pass pid and pid type into send_sigqueue Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 10/20] signal: Pass pid type into group_send_sig_info Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 11/20] signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 12/20] signal: Pass pid type into do_send_sig_info Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 13/20] signal: Push pid type down into send_signal Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 14/20] signal: Push pid type down into __send_signal Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 15/20] signal: Push pid type down into complete_signal Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 16/20] fork: Move and describe why the code examines PIDNS_ADDING Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 17/20] fork: Unconditionally exit if a fatal signal is pending Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 18/20] signal: Add calculate_sigpending() Eric W. Biederman
2018-07-26 13:20                             ` Oleg Nesterov
2018-07-26 15:13                               ` Eric W. Biederman
2018-07-26 16:24                                 ` Oleg Nesterov
2018-07-24  3:24                           ` [PATCH 19/20] fork: Have new threads join on-going signal group stops Eric W. Biederman
2018-07-24  3:24                           ` [PATCH 20/20] signal: Don't restart fork when signals come in Eric W. Biederman
2018-07-24 17:27                             ` Linus Torvalds
2018-07-24 17:58                               ` Eric W. Biederman
2018-07-24 18:29                                 ` Linus Torvalds
2018-07-24 20:05                                   ` Eric W. Biederman
2018-07-24 20:15                                     ` Linus Torvalds
2018-07-24 20:40                                       ` [PATCH v2 " Eric W. Biederman
2018-07-24 20:56                                         ` Linus Torvalds
2018-07-24 21:24                                           ` Eric W. Biederman
2018-07-25  3:56                                             ` [PATCH v3 " Eric W. Biederman
2018-07-26 13:41                                               ` Oleg Nesterov
2018-07-26 14:42                                                 ` Eric W. Biederman
2018-07-26 15:55                                                   ` Oleg Nesterov
2018-08-09  6:19                                                     ` Eric W. Biederman
2018-07-26 16:21                                               ` Oleg Nesterov
2018-07-24 17:29                           ` [PATCH 00/20] PIDTYPE_TGID removal of fork restarts Linus Torvalds
2018-07-25 16:05                           ` Oleg Nesterov
2018-07-25 16:58                             ` Eric W. Biederman
2018-08-09  6:53                           ` [PATCH v5 0/6] Not restarting for due to signals Eric W. Biederman
2018-08-09  6:56                             ` [PATCH v5 1/6] fork: Move and describe why the code examines PIDNS_ADDING Eric W. Biederman
2018-08-09  6:56                             ` [PATCH v5 2/6] fork: Unconditionally exit if a fatal signal is pending Eric W. Biederman
2018-08-09  6:56                             ` [PATCH v5 3/6] signal: Add calculate_sigpending() Eric W. Biederman
2018-08-09  6:56                             ` [PATCH v5 4/6] fork: Skip setting TIF_SIGPENDING in ptrace_init_task Eric W. Biederman
2018-08-09  6:56                             ` [PATCH v5 5/6] fork: Have new threads join on-going signal group stops Eric W. Biederman
2018-08-09  6:56                             ` [PATCH v5 6/6] signal: Don't restart fork when signals come in Eric W. Biederman
2018-08-09 17:15                               ` Linus Torvalds
2018-08-09 17:42                                 ` Eric W. Biederman
2018-08-09 18:09                                 ` [PATCH v6 " Eric W. Biederman
2018-08-09 17:16                             ` [PATCH v5 0/6] Not restarting for due to signals Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).