[PATCH 0/4] fs, tty: Make __do_SAK() less greedy in regard to tasklist

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] fs, tty: Make __do_SAK() less greedy in regard to tasklist_lock
@ 2018-01-11 15:49 Kirill Tkhai
  2018-01-11 15:50 ` [PATCH 1/4] exec: Pass unshared files_struct to load_binary() Kirill Tkhai
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-11 15:49 UTC (permalink / raw)
  To: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, oleg, mingo, akpm, mhocko,
	peterz, ktkhai

Hi,

this patchset makes __do_SAK() to take tasklist_lock for very small time
in comparison to that it does now. Though this function is executed
in process context and it takes tasklist_lock read locked with interrupts enabled,
another tasks may want to take it for writing with interrupt disabled
(e.g., forking tasks), and these tasks may evoke hard lockups.

I've observed several hard lockups caused by long execution of __do_SAK()
on the node with 200 big containers. 3.10 kernel is used there, and mainline
kernel does not have differences in comparation to that, because of __do_SAK()
function has not changed for a long time. So, mainline kernel has this problem too.

The patchset proposes two optimizations in __do__SAK(). The first one is
to skip threads, when process's open files are being analyzed (see [3/4]).
We have to check thread's files only to close the race with unshare_files()
and failing exec. This can be fixed via small modifications of unshare_files().
See [1-2/4] for the details.

The second optimization is to iterate task list under rcu_read_lock().
This allows to take tasklist_lock for a very small time just to check we
reached the end of the task list. See patch [4/4] for the details.

Thanks,
Kirill

---

Kirill Tkhai (4):
      exec: Pass unshared files_struct to load_binary()
      exec: Assign unshared files after there is no way back
      tty: Iterate only thread group leaders in __do_SAK()
      tty: Use RCU read lock to iterate tasks in __do_SAK()

 drivers/tty/tty_io.c    |   34 +++++++++++++++++++++++++---------
 fs/binfmt_misc.c        |    8 ++++----
 fs/coredump.c           |    9 +++++----
 fs/exec.c               |   14 ++++++++------
 fs/file.c               |    4 +++-
 include/linux/binfmts.h |    1 +
 include/linux/fdtable.h |    4 ++--
 kernel/fork.c           |   19 +++++++------------
 8 files changed, 55 insertions(+), 38 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] exec: Pass unshared files_struct to load_binary()
  2018-01-11 15:49 [PATCH 0/4] fs, tty: Make __do_SAK() less greedy in regard to tasklist_lock Kirill Tkhai
@ 2018-01-11 15:50 ` Kirill Tkhai
  2018-01-11 15:50 ` [PATCH 2/4] exec: Assign unshared files after there is no way back Kirill Tkhai
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-11 15:50 UTC (permalink / raw)
  To: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, oleg, mingo, akpm, mhocko,
	peterz, ktkhai

This patch adds pointer to new unshared files_struct to struct linux_binprm
and makes load_misc_binary() method to use __alloc_fd() and __fd_install(),
instead of alloc_fd() and fd_install(), which are related to current task.
It's a preparation for next patch, that stops assigning of current->files
before load_binary call. Since bprm->new_files == current->files now,
this patch is just a preparation, and it don't change functionality.
The only difference is export of two functions, that I pointed above.

load_misc_binary() is the only load_binary method modifying fd table,
so other methods are not modified.

Note, that I don't replace sys_close(fd_binary) with __close_fd()
in load_misc_binary(), because of next patch will kill this close
completely, while __close_fd() requires to be exported too. It's
pointless to export it and then to unexport it, so we skip this.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/binfmt_misc.c        |    6 ++++--
 fs/exec.c               |    1 +
 fs/file.c               |    2 ++
 include/linux/binfmts.h |    1 +
 4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index a7c5a9861bef..13c1fae45717 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -19,6 +19,7 @@
 #include <linux/ctype.h>
 #include <linux/string_helpers.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/pagemap.h>
 #include <linux/namei.h>
 #include <linux/mount.h>
@@ -164,12 +165,13 @@ static int load_misc_binary(struct linux_binprm *bprm)
 		 * interpreter than keep it open and assign descriptor
 		 * to it
 		 */
-		fd_binary = get_unused_fd_flags(0);
+		fd_binary = __alloc_fd(bprm->new_files, 0,
+				       rlimit(RLIMIT_NOFILE), 0);
 		if (fd_binary < 0) {
 			retval = fd_binary;
 			goto ret;
 		}
-		fd_install(fd_binary, bprm->file);
+		__fd_install(bprm->new_files, fd_binary, bprm->file);
 
 		/* if the binary is not readable than enforce mm->dumpable=0
 		   regardless of the interpreter's permissions */
diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21bcab9..4c3b924ae103 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1729,6 +1729,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
 	if (!bprm)
 		goto out_files;
+	bprm->new_files = current->files;
 
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
diff --git a/fs/file.c b/fs/file.c
index fc0eeb812e2c..e578e5537c32 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -538,6 +538,7 @@ int __alloc_fd(struct files_struct *files,
 	spin_unlock(&files->file_lock);
 	return error;
 }
+EXPORT_SYMBOL_GPL(__alloc_fd);
 
 static int alloc_fd(unsigned start, unsigned flags)
 {
@@ -611,6 +612,7 @@ void __fd_install(struct files_struct *files, unsigned int fd,
 	rcu_assign_pointer(fdt->fd[fd], file);
 	rcu_read_unlock_sched();
 }
+EXPORT_SYMBOL_GPL(__fd_install);
 
 void fd_install(unsigned int fd, struct file *file)
 {
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b0abe21d6cc9..1470be496ad2 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -24,6 +24,7 @@ struct linux_binprm {
 	struct page *page[MAX_ARG_PAGES];
 #endif
 	struct mm_struct *mm;
+	struct files_struct *new_files;
 	unsigned long p; /* current top of mem */
 	unsigned int
 		/*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] exec: Assign unshared files after there is no way back
  2018-01-11 15:49 [PATCH 0/4] fs, tty: Make __do_SAK() less greedy in regard to tasklist_lock Kirill Tkhai
  2018-01-11 15:50 ` [PATCH 1/4] exec: Pass unshared files_struct to load_binary() Kirill Tkhai
@ 2018-01-11 15:50 ` Kirill Tkhai
  2018-01-11 15:50 ` [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK() Kirill Tkhai
  2018-01-11 15:50 ` [PATCH 4/4] tty: Use RCU read lock to iterate tasks " Kirill Tkhai
  3 siblings, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-11 15:50 UTC (permalink / raw)
  To: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, oleg, mingo, akpm, mhocko,
	peterz, ktkhai

The patch makes unshared files_struct to be assigned
after exec or coredump are known to be successeful.
Since previous patch converted all load_binary methods
to use linux_binprm::new_files instead of current->files,
now we are able to assign unshared_files after load_binary
call.

This change makes task->files more predictable and since
that we may analyse the only thread's files pointer
to determ either the task contains a fd or not, and not
be afraid of race with failing exec. Next patch will
use this feature to skip thread group iteration in __do_SAK(),
and also this predictability may be useful for something else.

Also note, that without the patch we skip failing exec,
when we are doing the iterations in __do_SAK(), and
these tasks remain alive. The patch also fixes this rare issue.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/binfmt_misc.c        |    2 --
 fs/coredump.c           |    9 +++++----
 fs/exec.c               |   15 ++++++++-------
 fs/file.c               |    2 +-
 include/linux/fdtable.h |    4 ++--
 kernel/fork.c           |   19 +++++++------------
 6 files changed, 23 insertions(+), 28 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 13c1fae45717..7568456895c7 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -242,8 +242,6 @@ static int load_misc_binary(struct linux_binprm *bprm)
 	dput(fmt->dentry);
 	return retval;
 error:
-	if (fd_binary > 0)
-		sys_close(fd_binary);
 	bprm->interp_flags = 0;
 	bprm->interp_data = 0;
 	goto ret;
diff --git a/fs/coredump.c b/fs/coredump.c
index 1e2c87acac9b..fe5a4504d975 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -546,7 +546,7 @@ void do_coredump(const siginfo_t *siginfo)
 	struct cred *cred;
 	int retval = 0;
 	int ispipe;
-	struct files_struct *displaced;
+	struct files_struct *files;
 	/* require nonrelative corefile path and be extra careful */
 	bool need_suid_safe = false;
 	bool core_dumped = false;
@@ -747,11 +747,12 @@ void do_coredump(const siginfo_t *siginfo)
 	}
 
 	/* get us an unshared descriptor table; almost always a no-op */
-	retval = unshare_files(&displaced);
+	files = unshare_files();
+	retval = PTR_ERR_OR_ZERO(files);
 	if (retval)
 		goto close_fail;
-	if (displaced)
-		put_files_struct(displaced);
+	if (files != current->files)
+		assign_files_struct(files);
 	if (!dump_interrupted()) {
 		file_start_write(cprm.file);
 		core_dumped = binfmt->core_dump(&cprm);
diff --git a/fs/exec.c b/fs/exec.c
index 4c3b924ae103..ced4dc09444a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1699,7 +1699,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
 	struct file *file;
-	struct files_struct *displaced;
+	struct files_struct *files;
 	int retval;
 
 	if (IS_ERR(filename))
@@ -1721,7 +1721,8 @@ static int do_execveat_common(int fd, struct filename *filename,
 	 * further execve() calls fail. */
 	current->flags &= ~PF_NPROC_EXCEEDED;
 
-	retval = unshare_files(&displaced);
+	files = unshare_files();
+	retval = PTR_ERR_OR_ZERO(files);
 	if (retval)
 		goto out_ret;
 
@@ -1729,7 +1730,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
 	if (!bprm)
 		goto out_files;
-	bprm->new_files = current->files;
+	bprm->new_files = files;
 
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
@@ -1813,8 +1814,8 @@ static int do_execveat_common(int fd, struct filename *filename,
 	free_bprm(bprm);
 	kfree(pathbuf);
 	putname(filename);
-	if (displaced)
-		put_files_struct(displaced);
+	if (files != current->files)
+		assign_files_struct(files);
 	return retval;
 
 out:
@@ -1832,8 +1833,8 @@ static int do_execveat_common(int fd, struct filename *filename,
 	kfree(pathbuf);
 
 out_files:
-	if (displaced)
-		reset_files_struct(displaced);
+	if (files != current->files)
+		put_files_struct(files);
 out_ret:
 	putname(filename);
 	return retval;
diff --git a/fs/file.c b/fs/file.c
index e578e5537c32..7ed519c65d0b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -427,7 +427,7 @@ void put_files_struct(struct files_struct *files)
 	}
 }
 
-void reset_files_struct(struct files_struct *files)
+void assign_files_struct(struct files_struct *files)
 {
 	struct task_struct *tsk = current;
 	struct files_struct *old;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 1c65817673db..00db7eadf895 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -104,8 +104,8 @@ struct task_struct;
 
 struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
-void reset_files_struct(struct files_struct *);
-int unshare_files(struct files_struct **);
+void assign_files_struct(struct files_struct *);
+struct files_struct *unshare_files(void);
 struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy;
 void do_close_on_exec(struct files_struct *);
 int iterate_fd(struct files_struct *, unsigned,
diff --git a/kernel/fork.c b/kernel/fork.c
index 2295fc69717f..7690734eb354 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2434,22 +2434,17 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
  *	the exec layer of the kernel.
  */
 
-int unshare_files(struct files_struct **displaced)
+struct files_struct *unshare_files(void)
 {
 	struct task_struct *task = current;
-	struct files_struct *copy = NULL;
+	struct files_struct *files = task->files;
 	int error;
 
-	error = unshare_fd(CLONE_FILES, &copy);
-	if (error || !copy) {
-		*displaced = NULL;
-		return error;
-	}
-	*displaced = task->files;
-	task_lock(task);
-	task->files = copy;
-	task_unlock(task);
-	return 0;
+	error = unshare_fd(CLONE_FILES, &files);
+	if (error)
+		return ERR_PTR(error);
+
+	return files;
 }
 
 int sysctl_max_threads(struct ctl_table *table, int write,

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-11 15:49 [PATCH 0/4] fs, tty: Make __do_SAK() less greedy in regard to tasklist_lock Kirill Tkhai
  2018-01-11 15:50 ` [PATCH 1/4] exec: Pass unshared files_struct to load_binary() Kirill Tkhai
  2018-01-11 15:50 ` [PATCH 2/4] exec: Assign unshared files after there is no way back Kirill Tkhai
@ 2018-01-11 15:50 ` Kirill Tkhai
  2018-01-11 18:34   ` Oleg Nesterov
  2018-01-11 15:50 ` [PATCH 4/4] tty: Use RCU read lock to iterate tasks " Kirill Tkhai
  3 siblings, 1 reply; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-11 15:50 UTC (permalink / raw)
  To: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, oleg, mingo, akpm, mhocko,
	peterz, ktkhai

Since threads can't have additional fd in comparison
to thread group leader (previous patch closed races
with failing exec), we may skip useless iterations
over threads files, as they definitely have the same
files struct, as thread group leader.

So, skip the pointless iterations and make __do_SAK()
faster.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 drivers/tty/tty_io.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index dc60aeea87d8..94813ae40983 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2704,7 +2704,7 @@ void __do_SAK(struct tty_struct *tty)
 #ifdef TTY_SOFT_SAK
 	tty_hangup(tty);
 #else
-	struct task_struct *g, *p;
+	struct task_struct *p;
 	struct pid *session;
 	int		i;
 
@@ -2725,7 +2725,7 @@ void __do_SAK(struct tty_struct *tty)
 	} while_each_pid_task(session, PIDTYPE_SID, p);
 
 	/* Now kill any processes that happen to have the tty open */
-	do_each_thread(g, p) {
+	for_each_process(p) {
 		if (p->signal->tty == tty) {
 			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
 				   task_pid_nr(p), p->comm);
@@ -2740,7 +2740,7 @@ void __do_SAK(struct tty_struct *tty)
 			force_sig(SIGKILL, p);
 		}
 		task_unlock(p);
-	} while_each_thread(g, p);
+	}
 	read_unlock(&tasklist_lock);
 #endif
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] tty: Use RCU read lock to iterate tasks in __do_SAK()
  2018-01-11 15:49 [PATCH 0/4] fs, tty: Make __do_SAK() less greedy in regard to tasklist_lock Kirill Tkhai
                   ` (2 preceding siblings ...)
  2018-01-11 15:50 ` [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK() Kirill Tkhai
@ 2018-01-11 15:50 ` Kirill Tkhai
  3 siblings, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-11 15:50 UTC (permalink / raw)
  To: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, oleg, mingo, akpm, mhocko,
	peterz, ktkhai

There were made several efforts to make __do_SAK()
working in process context long ago, but it does
not solves the problem completely. Since __do_SAK()
may take tasklist_lock for a long time, the concurent
processes waiting for write lock with interrupts
disabled (e.g., forking), get into the same situation
like __do_SAK() would have been executed in interrupt
context. I've observed several hard lockups on 3.10
kernel running 200 containers, caused by long duration
of copy_process()->write_lock_irq() after SAK was sent
to a tty. Current mainline kernel has the same problem.
So, this patch solves the problem and makes __do_SAK()
to be not greedy of tasklist_lock.

The solution is to use RCU to iterate processes and files.
Task list integrity is the only reason we taken tasklist_lock
before, as tty subsys primitives mostly take it for reading
also (e.g., __proc_set_tty). RCU read lock is enough for that.
Task list lock is taken for a small duration to check we've
iterated all the list and do not race with newly forked children.
It's not really useful, but we iterated the whole list before,
so let's save the old behaviour.

This patch covers all the cases, when __do_SAK() used
to find a tty-related task before. Also, I reordered
file iterations and p->signal->tty check and added
tty_lock() to close small race with the case, when
task obtains tty fd and then calls __proc_set_tty()
in parallel with its processing.

The patch is aimed to prevent hard lockups I've pointed above.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 drivers/tty/tty_io.c |   28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 94813ae40983..2b6afe9e7bbb 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2707,6 +2707,7 @@ void __do_SAK(struct tty_struct *tty)
 	struct task_struct *p;
 	struct pid *session;
 	int		i;
+	bool locked = false;

 	if (!tty)
 		return;
@@ -2723,15 +2724,12 @@ void __do_SAK(struct tty_struct *tty)
 			   task_pid_nr(p), p->comm);
 		send_sig(SIGKILL, p, 1);
 	} while_each_pid_task(session, PIDTYPE_SID, p);
+	read_unlock(&tasklist_lock);

+	tty_lock(tty);
+	rcu_read_lock();
 	/* Now kill any processes that happen to have the tty open */
 	for_each_process(p) {
-		if (p->signal->tty == tty) {
-			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
-				   task_pid_nr(p), p->comm);
-			send_sig(SIGKILL, p, 1);
-			continue;
-		}
 		task_lock(p);
 		i = iterate_fd(p->files, 0, this_tty, tty);
 		if (i != 0) {
@@ -2740,8 +2738,26 @@ void __do_SAK(struct tty_struct *tty)
 			force_sig(SIGKILL, p);
 		}
 		task_unlock(p);
+
+		/*
+		 * p->signal is always valid for task_struct obtained
+		 * from the task list under rcu_read_lock().
+		 */
+		if (!i && p->signal->tty == tty) {
+			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
+				   task_pid_nr(p), p->comm);
+			send_sig(SIGKILL, p, 1);
+		}
+
+		if (unlikely(next_task(p) == &init_task && !locked)) {
+			/* Take the lock to pick newly forked tasks */
+			read_lock(&tasklist_lock);
+			locked = true;
+		}
 	}
 	read_unlock(&tasklist_lock);
+	rcu_read_unlock();
+	tty_unlock(tty);
 #endif
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-11 15:50 ` [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK() Kirill Tkhai
@ 2018-01-11 18:34   ` Oleg Nesterov
  2018-01-12  8:42     ` Kirill Tkhai
  0 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2018-01-11 18:34 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 01/11, Kirill Tkhai wrote:
>
> Since threads can't have additional fd in comparison
> to thread group leader
...
> as they definitely have the same
> files struct, as thread group leader.

Hmm. Why? Iirc CLONE_THREAD doesn't require CLONE_FILES?

Also. The group leader can exit, in this case its ->files == NULL
but other threads can be alive.

Oleg.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-11 18:34   ` Oleg Nesterov
@ 2018-01-12  8:42     ` Kirill Tkhai
  2018-01-12 10:05       ` Kirill Tkhai
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-12  8:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 11.01.2018 21:34, Oleg Nesterov wrote:
> On 01/11, Kirill Tkhai wrote:
>>
>> Since threads can't have additional fd in comparison
>> to thread group leader
> ...
>> as they definitely have the same
>> files struct, as thread group leader.
> 
> Hmm. Why? Iirc CLONE_THREAD doesn't require CLONE_FILES?

Oh, it's really so. Surprise. Thanks for pointing that.
I'll try to find a way, how we can iterate threads fds using rcu.

> Also. The group leader can exit, in this case its ->files == NULL
> but other threads can be alive.

Sure, thanks, Oleg.

Kirill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-12  8:42     ` Kirill Tkhai
@ 2018-01-12 10:05       ` Kirill Tkhai
  2018-01-12 16:42         ` Oleg Nesterov
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-12 10:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 12.01.2018 11:42, Kirill Tkhai wrote:
> On 11.01.2018 21:34, Oleg Nesterov wrote:
>> On 01/11, Kirill Tkhai wrote:
>>>
>>> Since threads can't have additional fd in comparison
>>> to thread group leader
>> ...
>>> as they definitely have the same
>>> files struct, as thread group leader.
>>
>> Hmm. Why? Iirc CLONE_THREAD doesn't require CLONE_FILES?
> 
> Oh, it's really so. Surprise. Thanks for pointing that.
> I'll try to find a way, how we can iterate threads fds using rcu.
> 
>> Also. The group leader can exit, in this case its ->files == NULL
>> but other threads can be alive.
> 
> Sure, thanks, Oleg.

How about this patch instead of the whole set? I left thread iterations
and added sighand locking for visability.

It looks like the only way, that already iterated tasks reopen tty fd again,
is when they obtain it from unix scm or from foreign /proc/[pid]/fd/[fd]
like it was before the patch. What do you think about this?

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index dc60aeea87d8..ab86aabfebc7 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2706,6 +2706,7 @@ void __do_SAK(struct tty_struct *tty)
 #else
 	struct task_struct *g, *p;
 	struct pid *session;
+	unsigned long flags;
 	int		i;
 
 	if (!tty)
@@ -2723,25 +2724,51 @@ void __do_SAK(struct tty_struct *tty)
 			   task_pid_nr(p), p->comm);
 		send_sig(SIGKILL, p, 1);
 	} while_each_pid_task(session, PIDTYPE_SID, p);
+	read_unlock(&tasklist_lock);
 
+	tty_lock(tty);
+	rcu_read_lock();
 	/* Now kill any processes that happen to have the tty open */
-	do_each_thread(g, p) {
-		if (p->signal->tty == tty) {
-			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
-				   task_pid_nr(p), p->comm);
-			send_sig(SIGKILL, p, 1);
-			continue;
+	for_each_process(g) {
+		for_each_thread(g, p) {
+			task_lock(p);
+			i = iterate_fd(p->files, 0, this_tty, tty);
+			if (i != 0) {
+				tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
+					   task_pid_nr(p), p->comm, i - 1);
+				force_sig(SIGKILL, p);
+			}
+			task_unlock(p);
+
+			/*
+			 * p->signal is always valid for task_struct obtained
+			 * from the task list under rcu_read_lock().
+			 */
+			if (!i && p->signal->tty == tty) {
+				tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
+					   task_pid_nr(p), p->comm);
+				send_sig(SIGKILL, p, 1);
+			}
+
+			if (READ_ONCE(p->thread_node.next) == &g->signal->thread_head) {
+				/* Take and drop the lock to see newly forked threads */
+				if (lock_task_sighand(p, &flags))
+					unlock_task_sighand(p, &flags);
+				else {
+					read_lock(&tasklist_lock);
+					read_unlock(&tasklist_lock);
+				}
+			}
 		}
-		task_lock(p);
-		i = iterate_fd(p->files, 0, this_tty, tty);
-		if (i != 0) {
-			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
-				   task_pid_nr(p), p->comm, i - 1);
-			force_sig(SIGKILL, p);
+
+		if (unlikely(next_task(p) == &init_task)) {
+			/* Take and drop the lock to see newly forked tasks */
+			read_lock(&tasklist_lock);
+			read_unlock(&tasklist_lock);
 		}
-		task_unlock(p);
-	} while_each_thread(g, p);
-	read_unlock(&tasklist_lock);
+	}
+	rcu_read_unlock();
+	tty_unlock(tty);
 #endif
 }
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-12 10:05       ` Kirill Tkhai
@ 2018-01-12 16:42         ` Oleg Nesterov
  2018-01-15  9:32           ` Kirill Tkhai
  0 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2018-01-12 16:42 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 01/12, Kirill Tkhai wrote:
>
> How about this patch instead of the whole set? I left thread iterations
> and added sighand locking for visability.

Kirill, I didn't really read this series so I don't quite understand what
are you actually trying to do...

__do_SAK() is racy anyway, a process can open tty right after it was checked,
and I do not understand why should we care about races with execve.

IOW, I do not understand why we can't simply use rcu_read_lock() after
do_each_pid_task/while_each_pid_task. Yes we can miss the new process/thread,
but if the creator process had this tty opened it should be killed by us so
fork/clone can't succeed: both do_fork() and send_sig() take the same lock
and do_fork() checks signal_pending() under ->siglock.

No?

And whatever we do, I think you are right and for_each_process() makes more
sense, and in the likely case all sub-threads should share the same file_struct.
So perhaps we should start with the simple cleanup? Say,

	for_each_process(p) {
		if (p->signal->tty == tty) {
			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
				   task_pid_nr(p), p->comm);
			goto kill;
		}

		files = NULL;
		for_each_thread(p, t) {
				if (t->files == files) /* racy but we do not care */
					continue;

				task_lock(t);
				files = t->files;
				i = iterate_fd(files, 0, this_tty, tty);
				task_unlock(t);

				if (i != 0) {
					tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
						   task_pid_nr(p), p->comm, i - 1);
					goto kill;
				}
		}

		continue;
kill:
		force_sig(SIGKILL, p);
	}

(see the uncompiled/untested patch below), then make another change to avoid
tasklist_lock?

Oleg.


--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2704,7 +2704,8 @@ void __do_SAK(struct tty_struct *tty)
 #ifdef TTY_SOFT_SAK
 	tty_hangup(tty);
 #else
-	struct task_struct *g, *p;
+	struct task_struct *p, *t;
+	struct files_struct files;
 	struct pid *session;
 	int		i;
 
@@ -2725,22 +2726,34 @@ void __do_SAK(struct tty_struct *tty)
 	} while_each_pid_task(session, PIDTYPE_SID, p);
 
 	/* Now kill any processes that happen to have the tty open */
-	do_each_thread(g, p) {
+	for_each_process(p) {
 		if (p->signal->tty == tty) {
 			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
 				   task_pid_nr(p), p->comm);
-			send_sig(SIGKILL, p, 1);
-			continue;
+			goto kill;
 		}
-		task_lock(p);
-		i = iterate_fd(p->files, 0, this_tty, tty);
-		if (i != 0) {
-			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
-				   task_pid_nr(p), p->comm, i - 1);
-			force_sig(SIGKILL, p);
+
+		files = NULL;
+		for_each_thread(p, t) {
+				if (t->files == files) /* racy but we do not care */
+					continue;
+
+				task_lock(t);
+				files = t->files;
+				i = iterate_fd(files, 0, this_tty, tty);
+				task_unlock(t);
+
+				if (i != 0) {
+					tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
+						   task_pid_nr(p), p->comm, i - 1);
+					goto kill;
+				}
 		}
-		task_unlock(p);
-	} while_each_thread(g, p);
+
+		continue;
+kill:
+		force_sig(SIGKILL, p);
+	}
 	read_unlock(&tasklist_lock);
 #endif
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-12 16:42         ` Oleg Nesterov
@ 2018-01-15  9:32           ` Kirill Tkhai
  2018-01-15 20:51             ` Oleg Nesterov
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-15  9:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 12.01.2018 19:42, Oleg Nesterov wrote:
> On 01/12, Kirill Tkhai wrote:
>>
>> How about this patch instead of the whole set? I left thread iterations
>> and added sighand locking for visability.
> 
> Kirill, I didn't really read this series so I don't quite understand what
> are you actually trying to do...
> 
> __do_SAK() is racy anyway, a process can open tty right after it was checked,
> and I do not understand why should we care about races with execve.

Please, just ignore two first patches. As I wrote I thought we iterate threads
to close race with exec (and missed that threads may have unshared fd table).
So, if we didn't use to care about such situations, I don't care them now. My main
target is just to speed up __do_SAK().
 
> IOW, I do not understand why we can't simply use rcu_read_lock() after
> do_each_pid_task/while_each_pid_task. Yes we can miss the new process/thread,
> but if the creator process had this tty opened it should be killed by us so
> fork/clone can't succeed: both do_fork() and send_sig() take the same lock
> and do_fork() checks signal_pending() under ->siglock.
> 
> No?

Yes, but we send signal not every time. So, this was the only reason I added
lock/unlock the locks. But anyway, __do_SAK() is racy and the effect of that
is minimal, so it seems we may skip this.

> And whatever we do, I think you are right and for_each_process() makes more
> sense, and in the likely case all sub-threads should share the same file_struct.
> So perhaps we should start with the simple cleanup? Say,
> 
> 	for_each_process(p) {
> 		if (p->signal->tty == tty) {
> 			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
> 				   task_pid_nr(p), p->comm);
> 			goto kill;
> 		}
> 
> 		files = NULL;
> 		for_each_thread(p, t) {
> 				if (t->files == files) /* racy but we do not care */
> 					continue;
> 
> 				task_lock(t);
> 				files = t->files;
> 				i = iterate_fd(files, 0, this_tty, tty);
> 				task_unlock(t);
> 
> 				if (i != 0) {
> 					tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
> 						   task_pid_nr(p), p->comm, i - 1);
> 					goto kill;
> 				}
> 		}
> 
> 		continue;
> kill:
> 		force_sig(SIGKILL, p);
> 	}
> 
> (see the uncompiled/untested patch below), then make another change to avoid
> tasklist_lock?
> 
> 
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -2704,7 +2704,8 @@ void __do_SAK(struct tty_struct *tty)
>  #ifdef TTY_SOFT_SAK
>  	tty_hangup(tty);
>  #else
> -	struct task_struct *g, *p;
> +	struct task_struct *p, *t;
> +	struct files_struct files;
>  	struct pid *session;
>  	int		i;
>  
> @@ -2725,22 +2726,34 @@ void __do_SAK(struct tty_struct *tty)
>  	} while_each_pid_task(session, PIDTYPE_SID, p);
>  
>  	/* Now kill any processes that happen to have the tty open */
> -	do_each_thread(g, p) {
> +	for_each_process(p) {
>  		if (p->signal->tty == tty) {
>  			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
>  				   task_pid_nr(p), p->comm);
> -			send_sig(SIGKILL, p, 1);
> -			continue;
> +			goto kill;
>  		}
> -		task_lock(p);
> -		i = iterate_fd(p->files, 0, this_tty, tty);
> -		if (i != 0) {
> -			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
> -				   task_pid_nr(p), p->comm, i - 1);
> -			force_sig(SIGKILL, p);
> +
> +		files = NULL;
> +		for_each_thread(p, t) {
> +				if (t->files == files) /* racy but we do not care */
> +					continue;
> +
> +				task_lock(t);
> +				files = t->files;
> +				i = iterate_fd(files, 0, this_tty, tty);
> +				task_unlock(t);
> +
> +				if (i != 0) {
> +					tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
> +						   task_pid_nr(p), p->comm, i - 1);
> +					goto kill;
> +				}
>  		}
> -		task_unlock(p);
> -	} while_each_thread(g, p);
> +
> +		continue;
> +kill:
> +		force_sig(SIGKILL, p);
> +	}
>  	read_unlock(&tasklist_lock);
>  #endif
>  }

I tested your patch with small modification in "struct files_struct *files;" ('*' is added).
Could I send it with your "Signed-off-by" as the second version?

Also, the below patch will go on top of yours:

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 979eb5d80fe9..20b74b4c9f84 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2724,7 +2724,9 @@ void __do_SAK(struct tty_struct *tty)
 			   task_pid_nr(p), p->comm);
 		send_sig(SIGKILL, p, 1);
 	} while_each_pid_task(session, PIDTYPE_SID, p);
+	read_unlock(&tasklist_lock);
 
+	rcu_read_lock();
 	/* Now kill any processes that happen to have the tty open */
 	for_each_process(p) {
 		if (p->signal->tty == tty) {
@@ -2752,9 +2754,9 @@ void __do_SAK(struct tty_struct *tty)
 
 		continue;
 kill:
-		force_sig(SIGKILL, p);
+		send_sig(SIGKILL, p, 1);
 	}
-	read_unlock(&tasklist_lock);
+	rcu_read_unlock();
 #endif
 }
 
I replaced force_sig() as it does not check for task's sighand.

Kirill

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-15  9:32           ` Kirill Tkhai
@ 2018-01-15 20:51             ` Oleg Nesterov
  2018-01-16 11:33               ` Kirill Tkhai
  0 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2018-01-15 20:51 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 01/15, Kirill Tkhai wrote:
>
> On 12.01.2018 19:42, Oleg Nesterov wrote:
>
> > IOW, I do not understand why we can't simply use rcu_read_lock() after
> > do_each_pid_task/while_each_pid_task. Yes we can miss the new process/thread,
> > but if the creator process had this tty opened it should be killed by us so
> > fork/clone can't succeed: both do_fork() and send_sig() take the same lock
> > and do_fork() checks signal_pending() under ->siglock.
> >
> > No?
>
> Yes, but we send signal not every time. So, this was the only reason I added
> lock/unlock the locks. But anyway, __do_SAK() is racy and the effect of that
> is minimal, so it seems we may skip this.

Yes. If we don't send SIGKILL we do not care about the new child process/thread
we can miss, it can't have this tty opened at fork() time. If the child opens
this tty after that, __do_SAK can "miss" it anyway in that it can see it before
it does open(tty).

> I tested your patch with small modification in "struct files_struct *files;" ('*' is added).
> Could I send it with your "Signed-off-by" as the second version?

Yes, please feel free,

>  kill:
> -		force_sig(SIGKILL, p);
> +		send_sig(SIGKILL, p, 1);

Agreed, I didn't actually want to use force_sig(SIGKILL), copy-and-paste error.

But. on the second thought this probably needs another change... I don't understand
these force_sig/send_sig in __do_SAK().

If signal->tty == tty it does send_sig(SIGKILL), this won't kill the global or
sub-namespace init.

However, if iterate_fd() finds this tty it does force_sig(SIGKILL) which clears
SIGNAL_UNKILLABLE, so it can kill even the global init.

This looks strange, and probably unintentional. So it seems yoou should start
with "revert 20ac94378 [PATCH] do_SAK: Don't recursively take the tasklist_lock" ?
The original reason for that commit has gone a long ago.

At the same time, I do not know if we actually want to kill sub-namespace inits
or not. If yes, we can use SEND_SIG_FORCED (better than ugly force_sig()) but
skip the global init. But this will need yet another change.

Oleg.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-15 20:51             ` Oleg Nesterov
@ 2018-01-16 11:33               ` Kirill Tkhai
  2018-01-16 21:13                 ` Oleg Nesterov
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-16 11:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 15.01.2018 23:51, Oleg Nesterov wrote:
> On 01/15, Kirill Tkhai wrote:
>>
>> On 12.01.2018 19:42, Oleg Nesterov wrote:
>>
>>> IOW, I do not understand why we can't simply use rcu_read_lock() after
>>> do_each_pid_task/while_each_pid_task. Yes we can miss the new process/thread,
>>> but if the creator process had this tty opened it should be killed by us so
>>> fork/clone can't succeed: both do_fork() and send_sig() take the same lock
>>> and do_fork() checks signal_pending() under ->siglock.
>>>
>>> No?
>>
>> Yes, but we send signal not every time. So, this was the only reason I added
>> lock/unlock the locks. But anyway, __do_SAK() is racy and the effect of that
>> is minimal, so it seems we may skip this.
> 
> Yes. If we don't send SIGKILL we do not care about the new child process/thread
> we can miss, it can't have this tty opened at fork() time. If the child opens
> this tty after that, __do_SAK can "miss" it anyway in that it can see it before
> it does open(tty).
> 
>> I tested your patch with small modification in "struct files_struct *files;" ('*' is added).
>> Could I send it with your "Signed-off-by" as the second version?
> 
> Yes, please feel free,
> 
>>  kill:
>> -		force_sig(SIGKILL, p);
>> +		send_sig(SIGKILL, p, 1);
> 
> Agreed, I didn't actually want to use force_sig(SIGKILL), copy-and-paste error.

force_sig() is still safe under tasklist_lock as release_task() unhashes a task
from the lists and destroys sighand at the same time under it. So, it seems
there is no a problem :)

Anyway, we could use send_sig_info(SIGKILL, SEND_SIG_FORCED, p) instead of that
in the patch like you suggested below.

> But. on the second thought this probably needs another change... I don't understand
> these force_sig/send_sig in __do_SAK().
> 
> If signal->tty == tty it does send_sig(SIGKILL), this won't kill the global or
> sub-namespace init.
> 
> However, if iterate_fd() finds this tty it does force_sig(SIGKILL) which clears
> SIGNAL_UNKILLABLE, so it can kill even the global init.

Also we skip global init on session iteration. This could be useful for debugging,
when init is "/bin/bash" and some task started on top of bash is hunged.

> This looks strange, and probably unintentional. So it seems yoou should start
> with "revert 20ac94378 [PATCH] do_SAK: Don't recursively take the tasklist_lock" ?
> The original reason for that commit has gone a long ago.

If we revert it, lock_task_sighand() will be nested with task_lock(). I'm not
sure either there is an example of such nesting in kernel. Yeah, it's not for
a long time, next commit will change that. But anyway. Maybe, we add global
init check there and send_sig_info(SIGKILL, SEND_SIG_FORCED, p)? Please, see
patches on the bottom of this message.

> At the same time, I do not know if we actually want to kill sub-namespace inits
> or not. If yes, we can use SEND_SIG_FORCED (better than ugly force_sig()) but
> skip the global init. But this will need yet another change.

>From https://www.kernel.org/doc/Documentation/SAK.txt:

"An operating system's Secure Attention Key is a security tool which is
 provided as protection against trojan password capturing programs.  It
 is an undefeatable way of killing all programs which could be
 masquerading as login applications"

It seems, since not privileged user is able to create pid_ns to start
a "trojan password capturing program", we have to kill sub-namespace inits too.

Maybe we'll use something like this? (With a hope, two patches in one message is readable)

===============================[PATCH 1]====================================================
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index dc60aeea87d8..0df71480762a 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2719,27 +2719,34 @@ void __do_SAK(struct tty_struct *tty)
 	read_lock(&tasklist_lock);
 	/* Kill the entire session */
 	do_each_pid_task(session, PIDTYPE_SID, p) {
+		if (unlikely(is_global_init(p)))
+			continue;
 		tty_notice(tty, "SAK: killed process %d (%s): by session\n",
 			   task_pid_nr(p), p->comm);
-		send_sig(SIGKILL, p, 1);
+		send_sig_info(SIGKILL, SEND_SIG_FORCED, p);
 	} while_each_pid_task(session, PIDTYPE_SID, p);
 
 	/* Now kill any processes that happen to have the tty open */
 	do_each_thread(g, p) {
+		if (unlikely(is_global_init(p)))
+			continue;
+
 		if (p->signal->tty == tty) {
 			tty_notice(tty, "SAK: killed process %d (%s): by controlling tty\n",
 				   task_pid_nr(p), p->comm);
-			send_sig(SIGKILL, p, 1);
-			continue;
+			goto kill;
 		}
 		task_lock(p);
 		i = iterate_fd(p->files, 0, this_tty, tty);
+		task_unlock(p);
 		if (i != 0) {
 			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
 				   task_pid_nr(p), p->comm, i - 1);
-			force_sig(SIGKILL, p);
+			goto kill;
 		}
-		task_unlock(p);
+		continue;
+kill:
+		send_sig_info(SIGKILL, SEND_SIG_FORCED, p);
 	} while_each_thread(g, p);
 	read_unlock(&tasklist_lock);
 #endif

===================================================[PATCH 2]=======================================
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 0df71480762a..791f1d965d04 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2704,7 +2704,8 @@ void __do_SAK(struct tty_struct *tty)
 #ifdef TTY_SOFT_SAK
 	tty_hangup(tty);
 #else
-	struct task_struct *g, *p;
+	struct task_struct *p, *t;
+	struct files_struct *files;
 	struct pid *session;
 	int		i;
 
@@ -2727,7 +2728,7 @@ void __do_SAK(struct tty_struct *tty)
 	} while_each_pid_task(session, PIDTYPE_SID, p);
 
 	/* Now kill any processes that happen to have the tty open */
-	do_each_thread(g, p) {
+	for_each_process(p) {
 		if (unlikely(is_global_init(p)))
 			continue;
 
@@ -2736,18 +2737,28 @@ void __do_SAK(struct tty_struct *tty)
 				   task_pid_nr(p), p->comm);
 			goto kill;
 		}
-		task_lock(p);
-		i = iterate_fd(p->files, 0, this_tty, tty);
-		task_unlock(p);
-		if (i != 0) {
-			tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
-				   task_pid_nr(p), p->comm, i - 1);
-			goto kill;
+
+		files = NULL;
+		for_each_thread(p, t) {
+			if (t->files == files) /* racy but we do not care */
+				continue;
+
+			task_lock(t);
+			files = t->files;
+			i = iterate_fd(files, 0, this_tty, tty);
+			task_unlock(t);
+
+			if (i != 0) {
+				tty_notice(tty, "SAK: killed process %d (%s): by fd#%d\n",
+					   task_pid_nr(t), t->comm, i - 1);
+				goto kill;
+			}
 		}
+
 		continue;
 kill:
 		send_sig_info(SIGKILL, SEND_SIG_FORCED, p);
-	} while_each_thread(g, p);
+	}
 	read_unlock(&tasklist_lock);
 #endif
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-16 11:33               ` Kirill Tkhai
@ 2018-01-16 21:13                 ` Oleg Nesterov
  2018-01-17 12:47                   ` Kirill Tkhai
  0 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2018-01-16 21:13 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 01/16, Kirill Tkhai wrote:
>
> On 15.01.2018 23:51, Oleg Nesterov wrote:
> >
> >>  kill:
> >> -		force_sig(SIGKILL, p);
> >> +		send_sig(SIGKILL, p, 1);
> >
> > Agreed, I didn't actually want to use force_sig(SIGKILL), copy-and-paste error.
>
> force_sig() is still safe under tasklist_lock as release_task() unhashes a task
> from the lists and destroys sighand at the same time under it. So, it seems
> there is no a problem :)

I didn't mean it is unsafe. The problem is that force_sig() replaced send_sig()
to avoid tasklist_lock which we no longer take in send-signal paths. Another
problem is that it differs from send_sig(SIGKILL) used in other places and this
difference (ability to kill SIGNAL_UNKILLABLE tasks) was added by accident, that
was my point.

> Anyway, we could use send_sig_info(SIGKILL, SEND_SIG_FORCED, p) instead of that
> in the patch like you suggested below.

Probably, but this needs another/separate change.

> Also we skip global init on session iteration. This could be useful for debugging,
> when init is "/bin/bash" and some task started on top of bash is hunged.

We will need this only after we use SEND_SIG_FORCED, send_sig(SIGKILL) won't kill
init.

> > This looks strange, and probably unintentional. So it seems yoou should start
> > with "revert 20ac94378 [PATCH] do_SAK: Don't recursively take the tasklist_lock" ?
> > The original reason for that commit has gone a long ago.
>
> If we revert it, lock_task_sighand() will be nested with task_lock().

This is safe. lock_task_sighand() is irq-safe (just like ->siglock) and it
is actually used in irqs. Thus it is safe to use it under task_lock() which
doesn't disable irqs.

And,

> Yeah, it's not for
> a long time, next commit will change that.

Yes, there is no reason to send SIGKILL under task_lock().

> > At the same time, I do not know if we actually want to kill sub-namespace inits
> > or not. If yes, we can use SEND_SIG_FORCED (better than ugly force_sig()) but
> > skip the global init. But this will need yet another change.
>
> From https://www.kernel.org/doc/Documentation/SAK.txt:
>
> "An operating system's Secure Attention Key is a security tool which is
>  provided as protection against trojan password capturing programs.  It
>  is an undefeatable way of killing all programs which could be
>  masquerading as login applications"
>
> It seems, since not privileged user is able to create pid_ns to start
> a "trojan password capturing program", we have to kill sub-namespace inits too.

Agreed, that is why I suggested SEND_SIG_FORCED.

However. this is the user-visible change and who knows, perhaps it is too late
to change the current behaviour. So I think we should do this after cleanups,
this way we can easily revert it later in (unlikely) case someone complains.

But, Kirill, this is up to you, I won't insist.

Oleg.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK()
  2018-01-16 21:13                 ` Oleg Nesterov
@ 2018-01-17 12:47                   ` Kirill Tkhai
  0 siblings, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2018-01-17 12:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, gregkh, jslaby, viro, keescook, serge,
	james.l.morris, luto, john.johansen, mingo, akpm, mhocko, peterz

On 17.01.2018 00:13, Oleg Nesterov wrote:
> On 01/16, Kirill Tkhai wrote:
>>
>> On 15.01.2018 23:51, Oleg Nesterov wrote:
>>>
>>>>  kill:
>>>> -		force_sig(SIGKILL, p);
>>>> +		send_sig(SIGKILL, p, 1);
>>>
>>> Agreed, I didn't actually want to use force_sig(SIGKILL), copy-and-paste error.
>>
>> force_sig() is still safe under tasklist_lock as release_task() unhashes a task
>> from the lists and destroys sighand at the same time under it. So, it seems
>> there is no a problem :)
> 
> I didn't mean it is unsafe. The problem is that force_sig() replaced send_sig()
> to avoid tasklist_lock which we no longer take in send-signal paths. Another
> problem is that it differs from send_sig(SIGKILL) used in other places and this
> difference (ability to kill SIGNAL_UNKILLABLE tasks) was added by accident, that
> was my point.
> 
>> Anyway, we could use send_sig_info(SIGKILL, SEND_SIG_FORCED, p) instead of that
>> in the patch like you suggested below.
> 
> Probably, but this needs another/separate change.

Ok, as there is no single right answer on this problem, let's skip it for this patchset.

>> Also we skip global init on session iteration. This could be useful for debugging,
>> when init is "/bin/bash" and some task started on top of bash is hunged.
> 
> We will need this only after we use SEND_SIG_FORCED, send_sig(SIGKILL) won't kill
> init.
> 
>>> This looks strange, and probably unintentional. So it seems yoou should start
>>> with "revert 20ac94378 [PATCH] do_SAK: Don't recursively take the tasklist_lock" ?
>>> The original reason for that commit has gone a long ago.
>>
>> If we revert it, lock_task_sighand() will be nested with task_lock().
> 
> This is safe. lock_task_sighand() is irq-safe (just like ->siglock) and it
> is actually used in irqs. Thus it is safe to use it under task_lock() which
> doesn't disable irqs.
> 
> And,
> 
>> Yeah, it's not for
>> a long time, next commit will change that.
> 
> Yes, there is no reason to send SIGKILL under task_lock().
> 
>>> At the same time, I do not know if we actually want to kill sub-namespace inits
>>> or not. If yes, we can use SEND_SIG_FORCED (better than ugly force_sig()) but
>>> skip the global init. But this will need yet another change.
>>
>> From https://www.kernel.org/doc/Documentation/SAK.txt:
>>
>> "An operating system's Secure Attention Key is a security tool which is
>>  provided as protection against trojan password capturing programs.  It
>>  is an undefeatable way of killing all programs which could be
>>  masquerading as login applications"
>>
>> It seems, since not privileged user is able to create pid_ns to start
>> a "trojan password capturing program", we have to kill sub-namespace inits too.
> 
> Agreed, that is why I suggested SEND_SIG_FORCED.
> 
> However. this is the user-visible change and who knows, perhaps it is too late
> to change the current behaviour. So I think we should do this after cleanups,
> this way we can easily revert it later in (unlikely) case someone complains.
> 
> But, Kirill, this is up to you, I won't insist.

Thanks, Oleg. I've sent v2 "[PATCH v2 0/3] tty: Make __do_SAK() less greedy in regard to tasklist_lock"
taking into account all of your comments.

(It seems this theme is not interesting for most people. I've removed them from CC in v2,
to reduce people mail traffic)

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-01-17 12:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-11 15:49 [PATCH 0/4] fs, tty: Make __do_SAK() less greedy in regard to tasklist_lock Kirill Tkhai
2018-01-11 15:50 ` [PATCH 1/4] exec: Pass unshared files_struct to load_binary() Kirill Tkhai
2018-01-11 15:50 ` [PATCH 2/4] exec: Assign unshared files after there is no way back Kirill Tkhai
2018-01-11 15:50 ` [PATCH 3/4] tty: Iterate only thread group leaders in __do_SAK() Kirill Tkhai
2018-01-11 18:34   ` Oleg Nesterov
2018-01-12  8:42     ` Kirill Tkhai
2018-01-12 10:05       ` Kirill Tkhai
2018-01-12 16:42         ` Oleg Nesterov
2018-01-15  9:32           ` Kirill Tkhai
2018-01-15 20:51             ` Oleg Nesterov
2018-01-16 11:33               ` Kirill Tkhai
2018-01-16 21:13                 ` Oleg Nesterov
2018-01-17 12:47                   ` Kirill Tkhai
2018-01-11 15:50 ` [PATCH 4/4] tty: Use RCU read lock to iterate tasks " Kirill Tkhai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.