All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-09-30 17:58 ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-09-30 17:58 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Brian Gerst, Borislav Petkov, Jann Horn, Linux API,
	Linus Torvalds, Kees Cook, Tycho Andersen, Andy Lutomirski

Jann Horn noticed that KSTK_ESP + eager task stack freeing was a bad
combination and could crash.  I could very easily fix it to not
crash, but I think that using KSTK_ESP on a remote task is
questionable in general.  Therefore, I propose to get rid of the
major users for 4.9.

This series makes two ABI changes:

 - /proc/PID/stat will show 0 0 instead of esp eip.  I don't think
   that the esp and eip fields were ever reliable unless the target
   task was being ptraced by the reading task, and ptrace(2) gives a
   far better interface to the same thing in this case.  On the flip
   side, these fields could leak kernel addresses under some
   circumstances on some arches if the target task is running or was
   interrupted (on a remote CPU or preempted on the local CPU) in an
   inconvenient place.

   I suspect it made sense when everything was single-CPU and non-
   preemptible, which implied that the target task *had* to be
   sleeping in something resembling normal kernel code, but that
   hasn't been the case for a long time.

 - /proc/PID/task/TID/maps did some interesting things to guess which
   vma was the stack.  This behavior is recent and IMO dangerously
   racy.  I'd like to get rid of it.

This is a little late so, if there is significant objection, I'll
just do the easy fix for 4.9 and resubmit for 4.10.

Andy Lutomirski (3):
  proc: Stop reporting eip and esp in /proc/PID/stat
  proc: Stop trying to report thread stacks
  mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()

 Documentation/filesystems/proc.txt | 26 --------------------------
 fs/proc/array.c                    |  9 +++++----
 fs/proc/task_mmu.c                 | 29 ++++++++++-------------------
 fs/proc/task_nommu.c               | 28 ++++++++++------------------
 include/linux/mm.h                 |  2 +-
 mm/util.c                          |  4 +++-
 security/selinux/hooks.c           |  2 +-
 7 files changed, 30 insertions(+), 70 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-09-30 17:58 ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-09-30 17:58 UTC (permalink / raw)
  To: x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Linus Torvalds, Kees Cook,
	Tycho Andersen, Andy Lutomirski

Jann Horn noticed that KSTK_ESP + eager task stack freeing was a bad
combination and could crash.  I could very easily fix it to not
crash, but I think that using KSTK_ESP on a remote task is
questionable in general.  Therefore, I propose to get rid of the
major users for 4.9.

This series makes two ABI changes:

 - /proc/PID/stat will show 0 0 instead of esp eip.  I don't think
   that the esp and eip fields were ever reliable unless the target
   task was being ptraced by the reading task, and ptrace(2) gives a
   far better interface to the same thing in this case.  On the flip
   side, these fields could leak kernel addresses under some
   circumstances on some arches if the target task is running or was
   interrupted (on a remote CPU or preempted on the local CPU) in an
   inconvenient place.

   I suspect it made sense when everything was single-CPU and non-
   preemptible, which implied that the target task *had* to be
   sleeping in something resembling normal kernel code, but that
   hasn't been the case for a long time.

 - /proc/PID/task/TID/maps did some interesting things to guess which
   vma was the stack.  This behavior is recent and IMO dangerously
   racy.  I'd like to get rid of it.

This is a little late so, if there is significant objection, I'll
just do the easy fix for 4.9 and resubmit for 4.10.

Andy Lutomirski (3):
  proc: Stop reporting eip and esp in /proc/PID/stat
  proc: Stop trying to report thread stacks
  mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()

 Documentation/filesystems/proc.txt | 26 --------------------------
 fs/proc/array.c                    |  9 +++++----
 fs/proc/task_mmu.c                 | 29 ++++++++++-------------------
 fs/proc/task_nommu.c               | 28 ++++++++++------------------
 include/linux/mm.h                 |  2 +-
 mm/util.c                          |  4 +++-
 security/selinux/hooks.c           |  2 +-
 7 files changed, 30 insertions(+), 70 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-09-30 17:58   ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-09-30 17:58 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Brian Gerst, Borislav Petkov, Jann Horn, Linux API,
	Linus Torvalds, Kees Cook, Tycho Andersen, Andy Lutomirski,
	Tetsuo Handa

Reporting these fields on a non-current task is dangerous.  If the
task is in any state other than normal kernel code, they may contain
garbage or even kernel addresses on some architectures.  (x86_64
used to do this.  I bet lots of architectures still do.)  With
CONFIG_THREAD_INFO_IN_TASK, it can OOPS, too.

As far as I know, there are no use programs that make any material
use of these fields, so just get rid of them.

Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 fs/proc/array.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 88c7de12197b..1bb1097e73b7 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -417,10 +417,11 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	mm = get_task_mm(task);
 	if (mm) {
 		vsize = task_vsize(mm);
-		if (permitted) {
-			eip = KSTK_EIP(task);
-			esp = KSTK_ESP(task);
-		}
+		/*
+		 * esp and eip are intentionally zeroed out.  There is no
+		 * non-racy way to read them without freezing the task.
+		 * Programs that need reliable values can use ptrace(2).
+		 */
 	}
 
 	get_task_comm(tcomm, task);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-09-30 17:58   ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-09-30 17:58 UTC (permalink / raw)
  To: x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Linus Torvalds, Kees Cook,
	Tycho Andersen, Andy Lutomirski, Tetsuo Handa

Reporting these fields on a non-current task is dangerous.  If the
task is in any state other than normal kernel code, they may contain
garbage or even kernel addresses on some architectures.  (x86_64
used to do this.  I bet lots of architectures still do.)  With
CONFIG_THREAD_INFO_IN_TASK, it can OOPS, too.

As far as I know, there are no use programs that make any material
use of these fields, so just get rid of them.

Cc: Tetsuo Handa <penguin-kernel-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org>
Cc: Tycho Andersen <tycho.andersen-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Reported-by: Jann Horn <jann-XZ1E9jl8jIdeoWH0uzbU5w@public.gmane.org>
Signed-off-by: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 fs/proc/array.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 88c7de12197b..1bb1097e73b7 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -417,10 +417,11 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	mm = get_task_mm(task);
 	if (mm) {
 		vsize = task_vsize(mm);
-		if (permitted) {
-			eip = KSTK_EIP(task);
-			esp = KSTK_ESP(task);
-		}
+		/*
+		 * esp and eip are intentionally zeroed out.  There is no
+		 * non-racy way to read them without freezing the task.
+		 * Programs that need reliable values can use ptrace(2).
+		 */
 	}
 
 	get_task_comm(tcomm, task);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/3] proc: Stop trying to report thread stacks
  2016-09-30 17:58 ` Andy Lutomirski
  (?)
  (?)
@ 2016-09-30 17:58 ` Andy Lutomirski
  2016-10-20 11:13   ` [tip:mm/urgent] fs/proc: " tip-bot for Andy Lutomirski
  -1 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2016-09-30 17:58 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Brian Gerst, Borislav Petkov, Jann Horn, Linux API,
	Linus Torvalds, Kees Cook, Tycho Andersen, Andy Lutomirski,
	Johannes Weiner

This reverts more of b76437579d13 ("procfs: mark thread stack
correctly in proc/<pid>/maps"), which was partially reverted by
65376df58217 ("proc: revert /proc/<pid>/maps [stack:TID]
annotation").

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.
In contract, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA.  This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD, crash-prone --
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range.  This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn <jann@thejh.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 Documentation/filesystems/proc.txt | 26 --------------------------
 fs/proc/task_mmu.c                 | 29 ++++++++++-------------------
 fs/proc/task_nommu.c               | 28 ++++++++++------------------
 3 files changed, 20 insertions(+), 63 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 68080ad6a75e..0457bf57d2eb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -395,32 +395,6 @@ is not associated with a file:
 
  or if empty, the mapping is anonymous.
 
-The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint
-of the individual tasks of a process. In this file you will see a mapping marked
-as [stack] if that task sees it as a stack. Hence, for the example above, the
-task-level map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this:
-
-08048000-08049000 r-xp 00000000 03:00 8312       /opt/test
-08049000-0804a000 rw-p 00001000 03:00 8312       /opt/test
-0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
-a7cb1000-a7cb2000 ---p 00000000 00:00 0
-a7cb2000-a7eb2000 rw-p 00000000 00:00 0
-a7eb2000-a7eb3000 ---p 00000000 00:00 0
-a7eb3000-a7ed5000 rw-p 00000000 00:00 0          [stack]
-a7ed5000-a8008000 r-xp 00000000 03:00 4222       /lib/libc.so.6
-a8008000-a800a000 r--p 00133000 03:00 4222       /lib/libc.so.6
-a800a000-a800b000 rw-p 00135000 03:00 4222       /lib/libc.so.6
-a800b000-a800e000 rw-p 00000000 00:00 0
-a800e000-a8022000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
-a8022000-a8023000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
-a8023000-a8024000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
-a8024000-a8027000 rw-p 00000000 00:00 0
-a8027000-a8043000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
-a8043000-a8044000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
-a8044000-a8045000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
-aff35000-aff4a000 rw-p 00000000 00:00 0
-ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
-
 The /proc/PID/smaps is an extension based on maps, showing the memory
 consumption for each of the process's mappings. For each of mappings there
 is a series of lines such as the following:
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f6fa99eca515..4e3a9510c9cc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -264,24 +264,15 @@ static int do_maps_open(struct inode *inode, struct file *file,
  * /proc/PID/maps that is the stack of the main task.
  */
 static int is_stack(struct proc_maps_private *priv,
-		    struct vm_area_struct *vma, int is_pid)
+		    struct vm_area_struct *vma)
 {
-	int stack = 0;
-
-	if (is_pid) {
-		stack = vma->vm_start <= vma->vm_mm->start_stack &&
-			vma->vm_end >= vma->vm_mm->start_stack;
-	} else {
-		struct inode *inode = priv->inode;
-		struct task_struct *task;
-
-		rcu_read_lock();
-		task = pid_task(proc_pid(inode), PIDTYPE_PID);
-		if (task)
-			stack = vma_is_stack_for_task(vma, task);
-		rcu_read_unlock();
-	}
-	return stack;
+	/*
+	 * We make no effort to guess what a given thread considers to be
+	 * its "stack".  It's not even well-defined for programs written
+	 * languages like Go.
+	 */
+	return vma->vm_start <= vma->vm_mm->start_stack &&
+		vma->vm_end >= vma->vm_mm->start_stack;
 }
 
 static void
@@ -352,7 +343,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
 			goto done;
 		}
 
-		if (is_stack(priv, vma, is_pid))
+		if (is_stack(priv, vma))
 			name = "[stack]";
 	}
 
@@ -1667,7 +1658,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 		seq_file_path(m, file, "\n\t= ");
 	} else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
 		seq_puts(m, " heap");
-	} else if (is_stack(proc_priv, vma, is_pid)) {
+	} else if (is_stack(proc_priv, vma)) {
 		seq_puts(m, " stack");
 	}
 
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index faacb0c0d857..37175621e890 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -124,25 +124,17 @@ unsigned long task_statm(struct mm_struct *mm,
 }
 
 static int is_stack(struct proc_maps_private *priv,
-		    struct vm_area_struct *vma, int is_pid)
+		    struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int stack = 0;
-
-	if (is_pid) {
-		stack = vma->vm_start <= mm->start_stack &&
-			vma->vm_end >= mm->start_stack;
-	} else {
-		struct inode *inode = priv->inode;
-		struct task_struct *task;
-
-		rcu_read_lock();
-		task = pid_task(proc_pid(inode), PIDTYPE_PID);
-		if (task)
-			stack = vma_is_stack_for_task(vma, task);
-		rcu_read_unlock();
-	}
-	return stack;
+
+	/*
+	 * We make no effort to guess what a given thread considers to be
+	 * its "stack".  It's not even well-defined for programs written
+	 * languages like Go.
+	 */
+	return vma->vm_start <= mm->start_stack &&
+		vma->vm_end >= mm->start_stack;
 }
 
 /*
@@ -184,7 +176,7 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma,
 	if (file) {
 		seq_pad(m, ' ');
 		seq_file_path(m, file, "");
-	} else if (mm && is_stack(priv, vma, is_pid)) {
+	} else if (mm && is_stack(priv, vma)) {
 		seq_pad(m, ' ');
 		seq_printf(m, "[stack]");
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/3] mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
@ 2016-09-30 17:58   ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-09-30 17:58 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Brian Gerst, Borislav Petkov, Jann Horn, Linux API,
	Linus Torvalds, Kees Cook, Tycho Andersen, Andy Lutomirski

Asking for a non-current task's stack can't be done without races
unless the task is frozen in kernel mode.  As far as I know,
vm_is_stack_for_task() never had a safe non-current use case.

The __unused annotation is because some KSTK_ESP implementations
ignore their parameter, which IMO is further justification for this
patch.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/mm.h       | 2 +-
 mm/util.c                | 4 +++-
 security/selinux/hooks.c | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9cd426..c365cf49f54a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1403,7 +1403,7 @@ static inline int stack_guard_page_end(struct vm_area_struct *vma,
 		!vma_growsup(vma->vm_next, addr);
 }
 
-int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t);
+int vma_is_stack_for_current(struct vm_area_struct *vma);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/util.c b/mm/util.c
index 662cddf914af..c174e8921995 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -230,8 +230,10 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 /* Check if the vma is being used as a stack by this task */
-int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t)
+int vma_is_stack_for_current(struct vm_area_struct *vma)
 {
+	struct task_struct * __maybe_unused t = current;
+
 	return (vma->vm_start <= KSTK_ESP(t) && vma->vm_end >= KSTK_ESP(t));
 }
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 13185a6c266a..3a40135bde5e 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3505,7 +3505,7 @@ static int selinux_file_mprotect(struct vm_area_struct *vma,
 		} else if (!vma->vm_file &&
 			   ((vma->vm_start <= vma->vm_mm->start_stack &&
 			     vma->vm_end >= vma->vm_mm->start_stack) ||
-			    vma_is_stack_for_task(vma, current))) {
+			    vma_is_stack_for_current(vma))) {
 			rc = current_has_perm(current, PROCESS__EXECSTACK);
 		} else if (vma->vm_file && vma->anon_vma) {
 			/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/3] mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
@ 2016-09-30 17:58   ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-09-30 17:58 UTC (permalink / raw)
  To: x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Linus Torvalds, Kees Cook,
	Tycho Andersen, Andy Lutomirski

Asking for a non-current task's stack can't be done without races
unless the task is frozen in kernel mode.  As far as I know,
vm_is_stack_for_task() never had a safe non-current use case.

The __unused annotation is because some KSTK_ESP implementations
ignore their parameter, which IMO is further justification for this
patch.

Signed-off-by: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/mm.h       | 2 +-
 mm/util.c                | 4 +++-
 security/selinux/hooks.c | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9cd426..c365cf49f54a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1403,7 +1403,7 @@ static inline int stack_guard_page_end(struct vm_area_struct *vma,
 		!vma_growsup(vma->vm_next, addr);
 }
 
-int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t);
+int vma_is_stack_for_current(struct vm_area_struct *vma);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/util.c b/mm/util.c
index 662cddf914af..c174e8921995 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -230,8 +230,10 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 /* Check if the vma is being used as a stack by this task */
-int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t)
+int vma_is_stack_for_current(struct vm_area_struct *vma)
 {
+	struct task_struct * __maybe_unused t = current;
+
 	return (vma->vm_start <= KSTK_ESP(t) && vma->vm_end >= KSTK_ESP(t));
 }
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 13185a6c266a..3a40135bde5e 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3505,7 +3505,7 @@ static int selinux_file_mprotect(struct vm_area_struct *vma,
 		} else if (!vma->vm_file &&
 			   ((vma->vm_start <= vma->vm_mm->start_stack &&
 			     vma->vm_end >= vma->vm_mm->start_stack) ||
-			    vma_is_stack_for_task(vma, current))) {
+			    vma_is_stack_for_current(vma))) {
 			rc = current_has_perm(current, PROCESS__EXECSTACK);
 		} else if (vma->vm_file && vma->anon_vma) {
 			/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-09-30 18:56     ` Jann Horn
  0 siblings, 0 replies; 38+ messages in thread
From: Jann Horn @ 2016-09-30 18:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Brian Gerst, Borislav Petkov, Linux API,
	Linus Torvalds, Kees Cook, Tycho Andersen, Tetsuo Handa

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

On Fri, Sep 30, 2016 at 10:58:56AM -0700, Andy Lutomirski wrote:
> Reporting these fields on a non-current task is dangerous.  If the
> task is in any state other than normal kernel code, they may contain
> garbage or even kernel addresses on some architectures.

Stupid question: Doesn't something similar apply to wchan?
Am I missing something?

It looks to me as if the get_wchan() implementation of X86 has the same
issue, with the difference that it's not obviously usable as an infoleak
because only known symbol names are printed, not numeric values.

get_wchan() basically does the following:

 - make sure the remote thread is sleeping (but don't take any locks to
   ensure it stays that way)
 - read the remote thread's saved kernel stack pointer
 - checks that the saved kernel stack pointer points to the main part of
   the remote stack
 - read a frame pointer through the saved kernel stack pointer; the read
   value could be anything if a race occured
 - check that the frame pointer points into the main part of the remote
   stack
 - read a saved instruction pointer through the frame pointer; the read
   value could be anything if a race occured
 - ensure that the "saved instruction pointer" doesn't point to the
   sections .sched.text and .spinlock.text
 - return the "saved instruction pointer"

So as far as I can tell, it *might* be possible for userspace to leak
information about the kernel text by alternatingly storing a proper
saved instruction pointer and a user-supplied value in a place from
which the kernel tries to read a saved instruction pointer.
By supplying values that are suspected to point to kernel text and
looking at the reported wchan values, an attacker could learn information
that allows him to e.g. defeat kASLR and even slowly locate functions in
custom kernels.

I vaguely remember seeing some similar code that defends against issues
like this by reading some counter beforehand and afterwards and checking
that it stays the same, but I'm not sure where that was. I think I saw it
while looking at the series for vmalloc()ed stacks.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-09-30 18:56     ` Jann Horn
  0 siblings, 0 replies; 38+ messages in thread
From: Jann Horn @ 2016-09-30 18:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Brian Gerst, Borislav Petkov, Linux API, Linus Torvalds,
	Kees Cook, Tycho Andersen, Tetsuo Handa

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

On Fri, Sep 30, 2016 at 10:58:56AM -0700, Andy Lutomirski wrote:
> Reporting these fields on a non-current task is dangerous.  If the
> task is in any state other than normal kernel code, they may contain
> garbage or even kernel addresses on some architectures.

Stupid question: Doesn't something similar apply to wchan?
Am I missing something?

It looks to me as if the get_wchan() implementation of X86 has the same
issue, with the difference that it's not obviously usable as an infoleak
because only known symbol names are printed, not numeric values.

get_wchan() basically does the following:

 - make sure the remote thread is sleeping (but don't take any locks to
   ensure it stays that way)
 - read the remote thread's saved kernel stack pointer
 - checks that the saved kernel stack pointer points to the main part of
   the remote stack
 - read a frame pointer through the saved kernel stack pointer; the read
   value could be anything if a race occured
 - check that the frame pointer points into the main part of the remote
   stack
 - read a saved instruction pointer through the frame pointer; the read
   value could be anything if a race occured
 - ensure that the "saved instruction pointer" doesn't point to the
   sections .sched.text and .spinlock.text
 - return the "saved instruction pointer"

So as far as I can tell, it *might* be possible for userspace to leak
information about the kernel text by alternatingly storing a proper
saved instruction pointer and a user-supplied value in a place from
which the kernel tries to read a saved instruction pointer.
By supplying values that are suspected to point to kernel text and
looking at the reported wchan values, an attacker could learn information
that allows him to e.g. defeat kASLR and even slowly locate functions in
custom kernels.

I vaguely remember seeing some similar code that defends against issues
like this by reading some counter beforehand and afterwards and checking
that it stays the same, but I'm not sure where that was. I think I saw it
while looking at the series for vmalloc()ed stacks.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-10-01  2:01       ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-10-01  2:01 UTC (permalink / raw)
  To: Jann Horn, Peter Zijlstra
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Brian Gerst,
	Borislav Petkov, Linux API, Linus Torvalds, Kees Cook,
	Tycho Andersen, Tetsuo Handa

[cc: PeterZ]

On Fri, Sep 30, 2016 at 11:56 AM, Jann Horn <jann@thejh.net> wrote:
> On Fri, Sep 30, 2016 at 10:58:56AM -0700, Andy Lutomirski wrote:
>> Reporting these fields on a non-current task is dangerous.  If the
>> task is in any state other than normal kernel code, they may contain
>> garbage or even kernel addresses on some architectures.
>
> Stupid question: Doesn't something similar apply to wchan?
> Am I missing something?
>
> It looks to me as if the get_wchan() implementation of X86 has the same
> issue, with the difference that it's not obviously usable as an infoleak
> because only known symbol names are printed, not numeric values.
>
> get_wchan() basically does the following:
>
>  - make sure the remote thread is sleeping (but don't take any locks to
>    ensure it stays that way)

Peter, how nasty would it be to add some lightish-weight lock that
lets us pin a task in a non-running state?  Maybe we could take the rq
lock, do something to the task to make it sleepy (steal it off the
queue?), unlock the lock, do whatever we're going, then take the lock
again and put it back.

Or if we had a seqlock-like thing, we could maybe arrange for
get_wchan to abort if the task get scheduled between when it starts
and when it finishes.

On an unrelated note, can we please lock down all the silly historical
*userspace* info leaks in /proc?  Nasty ones include: net, cmdline (at
the very least, only argv[0] should be visible if the reader lacks
ptrace access).

Less nasty ones include: limits, sched, autogroup, comm, wchan,
schedstat, cpuset, cgroup, oom_*, sessionid, coredump_filter

uid_map, gid_map, etc are just screwed up.  They should be per
*namespace* somewhere, and they should require creds on the namespace.

timerslack is totally fscked up -- it allows ugo to write and it
checks the wrong creds.  Jann, does your series fix that?

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-10-01  2:01       ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-10-01  2:01 UTC (permalink / raw)
  To: Jann Horn, Peter Zijlstra
  Cc: Andy Lutomirski, X86 ML, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Brian Gerst, Borislav Petkov, Linux API, Linus Torvalds,
	Kees Cook, Tycho Andersen, Tetsuo Handa

[cc: PeterZ]

On Fri, Sep 30, 2016 at 11:56 AM, Jann Horn <jann-XZ1E9jl8jIdeoWH0uzbU5w@public.gmane.org> wrote:
> On Fri, Sep 30, 2016 at 10:58:56AM -0700, Andy Lutomirski wrote:
>> Reporting these fields on a non-current task is dangerous.  If the
>> task is in any state other than normal kernel code, they may contain
>> garbage or even kernel addresses on some architectures.
>
> Stupid question: Doesn't something similar apply to wchan?
> Am I missing something?
>
> It looks to me as if the get_wchan() implementation of X86 has the same
> issue, with the difference that it's not obviously usable as an infoleak
> because only known symbol names are printed, not numeric values.
>
> get_wchan() basically does the following:
>
>  - make sure the remote thread is sleeping (but don't take any locks to
>    ensure it stays that way)

Peter, how nasty would it be to add some lightish-weight lock that
lets us pin a task in a non-running state?  Maybe we could take the rq
lock, do something to the task to make it sleepy (steal it off the
queue?), unlock the lock, do whatever we're going, then take the lock
again and put it back.

Or if we had a seqlock-like thing, we could maybe arrange for
get_wchan to abort if the task get scheduled between when it starts
and when it finishes.

On an unrelated note, can we please lock down all the silly historical
*userspace* info leaks in /proc?  Nasty ones include: net, cmdline (at
the very least, only argv[0] should be visible if the reader lacks
ptrace access).

Less nasty ones include: limits, sched, autogroup, comm, wchan,
schedstat, cpuset, cgroup, oom_*, sessionid, coredump_filter

uid_map, gid_map, etc are just screwed up.  They should be per
*namespace* somewhere, and they should require creds on the namespace.

timerslack is totally fscked up -- it allows ugo to write and it
checks the wrong creds.  Jann, does your series fix that?

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-10-01  4:22         ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2016-10-01  4:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Peter Zijlstra, Andy Lutomirski, X86 ML, linux-kernel,
	Brian Gerst, Borislav Petkov, Linux API, Kees Cook,
	Tycho Andersen, Tetsuo Handa

On Fri, Sep 30, 2016 at 7:01 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Peter, how nasty would it be to add some lightish-weight lock that
> lets us pin a task in a non-running state?  Maybe we could take the rq
> lock, do something to the task to make it sleepy (steal it off the
> queue?), unlock the lock, do whatever we're going, then take the lock
> again and put it back.

No. Don't do this. Forcing some sleeping lock in the core task state
/proc stuff is a nightmare. That thing ends up being used very heavily
under some loads. No _way_ is it ok to synchronize with the target
task.

> Or if we had a seqlock-like thing, we could maybe arrange for
> get_wchan to abort if the task get scheduled between when it starts
> and when it finishes.

seq_lock might be ok, but do we even need it? What's the worst that
can happen? An odd symbol name showing up in a race condition? Sounds
like a non-issue to me.

        Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-10-01  4:22         ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2016-10-01  4:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Peter Zijlstra, Andy Lutomirski, X86 ML,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Linux API, Kees Cook, Tycho Andersen,
	Tetsuo Handa

On Fri, Sep 30, 2016 at 7:01 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>
> Peter, how nasty would it be to add some lightish-weight lock that
> lets us pin a task in a non-running state?  Maybe we could take the rq
> lock, do something to the task to make it sleepy (steal it off the
> queue?), unlock the lock, do whatever we're going, then take the lock
> again and put it back.

No. Don't do this. Forcing some sleeping lock in the core task state
/proc stuff is a nightmare. That thing ends up being used very heavily
under some loads. No _way_ is it ok to synchronize with the target
task.

> Or if we had a seqlock-like thing, we could maybe arrange for
> get_wchan to abort if the task get scheduled between when it starts
> and when it finishes.

seq_lock might be ok, but do we even need it? What's the worst that
can happen? An odd symbol name showing up in a race condition? Sounds
like a non-issue to me.

        Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-10-01 10:37         ` Jann Horn
  0 siblings, 0 replies; 38+ messages in thread
From: Jann Horn @ 2016-10-01 10:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andy Lutomirski, X86 ML, linux-kernel,
	Brian Gerst, Borislav Petkov, Linux API, Linus Torvalds,
	Kees Cook, Tycho Andersen, Tetsuo Handa

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

On Fri, Sep 30, 2016 at 07:01:13PM -0700, Andy Lutomirski wrote:
> On an unrelated note, can we please lock down all the silly historical
> *userspace* info leaks in /proc?  Nasty ones include: net, cmdline (at
> the very least, only argv[0] should be visible if the reader lacks
> ptrace access).
> 
> Less nasty ones include: limits, sched, autogroup, comm, wchan,
> schedstat, cpuset, cgroup, oom_*, sessionid, coredump_filter

If that doesn't break stuff, I'm very much in favor of it.


> uid_map, gid_map, etc are just screwed up.  They should be per
> *namespace* somewhere, and they should require creds on the namespace.

What do you have in mind? Something like
/proc/namespaces/user:123456/{uid_map,gid_map,setgroups,parent_ns},
with jumped fake symlinks to the directory and its entries in /proc/$pid/?


> timerslack is totally fscked up -- it allows ugo to write and it
> checks the wrong creds.  Jann, does your series fix that?

Nope. Never noticed that thing so far, probably because it was only
added a few months ago. :/ Will add it to my series.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-10-01 10:37         ` Jann Horn
  0 siblings, 0 replies; 38+ messages in thread
From: Jann Horn @ 2016-10-01 10:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andy Lutomirski, X86 ML,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Linux API, Linus Torvalds, Kees Cook,
	Tycho Andersen, Tetsuo Handa

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

On Fri, Sep 30, 2016 at 07:01:13PM -0700, Andy Lutomirski wrote:
> On an unrelated note, can we please lock down all the silly historical
> *userspace* info leaks in /proc?  Nasty ones include: net, cmdline (at
> the very least, only argv[0] should be visible if the reader lacks
> ptrace access).
> 
> Less nasty ones include: limits, sched, autogroup, comm, wchan,
> schedstat, cpuset, cgroup, oom_*, sessionid, coredump_filter

If that doesn't break stuff, I'm very much in favor of it.


> uid_map, gid_map, etc are just screwed up.  They should be per
> *namespace* somewhere, and they should require creds on the namespace.

What do you have in mind? Something like
/proc/namespaces/user:123456/{uid_map,gid_map,setgroups,parent_ns},
with jumped fake symlinks to the directory and its entries in /proc/$pid/?


> timerslack is totally fscked up -- it allows ugo to write and it
> checks the wrong creds.  Jann, does your series fix that?

Nope. Never noticed that thing so far, probably because it was only
added a few months ago. :/ Will add it to my series.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-10-03 23:08   ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-10-03 23:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Brian Gerst, Borislav Petkov, Jann Horn,
	Linux API, Linus Torvalds, Kees Cook, Tycho Andersen

On Fri, Sep 30, 2016 at 10:58 AM, Andy Lutomirski <luto@kernel.org> wrote:
> Jann Horn noticed that KSTK_ESP + eager task stack freeing was a bad
> combination and could crash.  I could very easily fix it to not
> crash, but I think that using KSTK_ESP on a remote task is
> questionable in general.  Therefore, I propose to get rid of the
> major users for 4.9.

Ping!

We need to decide fairly soon whether to apply these (or perhaps just
patch 1 or just patches 2 and 3) for 4.9.  For any parts that aren't
applied, I'll send quick fixups to pin the stack in the offending
code.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-10-03 23:08   ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-10-03 23:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Linus Torvalds, Kees Cook,
	Tycho Andersen

On Fri, Sep 30, 2016 at 10:58 AM, Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Jann Horn noticed that KSTK_ESP + eager task stack freeing was a bad
> combination and could crash.  I could very easily fix it to not
> crash, but I think that using KSTK_ESP on a remote task is
> questionable in general.  Therefore, I propose to get rid of the
> major users for 4.9.

Ping!

We need to decide fairly soon whether to apply these (or perhaps just
patch 1 or just patches 2 and 3) for 4.9.  For any parts that aren't
applied, I'll send quick fixups to pin the stack in the offending
code.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-10-03 23:17     ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2016-10-03 23:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Kees Cook, Tycho Andersen

On Mon, Oct 3, 2016 at 4:08 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Ping!
>
> We need to decide fairly soon whether to apply these (or perhaps just
> patch 1 or just patches 2 and 3) for 4.9.  For any parts that aren't
> applied, I'll send quick fixups to pin the stack in the offending
> code.

I think we should apply it. Hopefully nothing uses it, and nobody will
notice. And if somebody *does* notice, the sooner we find out, the
better.

             Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-10-03 23:17     ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2016-10-03 23:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Brian Gerst, Borislav Petkov, Jann Horn, Linux API, Kees Cook,
	Tycho Andersen

On Mon, Oct 3, 2016 at 4:08 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>
> Ping!
>
> We need to decide fairly soon whether to apply these (or perhaps just
> patch 1 or just patches 2 and 3) for 4.9.  For any parts that aren't
> applied, I'll send quick fixups to pin the stack in the offending
> code.

I think we should apply it. Hopefully nothing uses it, and nobody will
notice. And if somebody *does* notice, the sooner we find out, the
better.

             Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-10-04  7:06       ` Raymond Jennings
  0 siblings, 0 replies; 38+ messages in thread
From: Raymond Jennings @ 2016-10-04  7:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, linux-kernel,
	Brian Gerst, Borislav Petkov, Jann Horn, Linux API, Kees Cook,
	Tycho Andersen

My personal opinion is that even looking at esp/rsp is asking for 
trouble.  The only reliable information is VM_STACK or another VM flag 
that makes the area expand in response to stack growth.

Besides, userspace could always play funky trampoline games with the 
stack pointer, or even dynamically expand the stack by doing a malloc 
if a stack overflow draws near, which would put the stack in the data 
section temporarily.

As long as esp is in the bounds of a valid VMA, my vote is that we 
should consider it undefined how the task uses it.

On Mon, Oct 3, 2016 at 4:17 PM, Linus Torvalds 
<torvalds@linux-foundation.org> wrote:
> On Mon, Oct 3, 2016 at 4:08 PM, Andy Lutomirski <luto@amacapital.net> 
> wrote:
>> 
>>  Ping!
>> 
>>  We need to decide fairly soon whether to apply these (or perhaps 
>> just
>>  patch 1 or just patches 2 and 3) for 4.9.  For any parts that aren't
>>  applied, I'll send quick fixups to pin the stack in the offending
>>  code.
> 
> I think we should apply it. Hopefully nothing uses it, and nobody will
> notice. And if somebody *does* notice, the sooner we find out, the
> better.
> 
>              Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-10-04  7:06       ` Raymond Jennings
  0 siblings, 0 replies; 38+ messages in thread
From: Raymond Jennings @ 2016-10-04  7:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Kees Cook, Tycho Andersen

My personal opinion is that even looking at esp/rsp is asking for 
trouble.  The only reliable information is VM_STACK or another VM flag 
that makes the area expand in response to stack growth.

Besides, userspace could always play funky trampoline games with the 
stack pointer, or even dynamically expand the stack by doing a malloc 
if a stack overflow draws near, which would put the stack in the data 
section temporarily.

As long as esp is in the bounds of a valid VMA, my vote is that we 
should consider it undefined how the task uses it.

On Mon, Oct 3, 2016 at 4:17 PM, Linus Torvalds 
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Mon, Oct 3, 2016 at 4:08 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> 
> wrote:
>> 
>>  Ping!
>> 
>>  We need to decide fairly soon whether to apply these (or perhaps 
>> just
>>  patch 1 or just patches 2 and 3) for 4.9.  For any parts that aren't
>>  applied, I'll send quick fixups to pin the stack in the offending
>>  code.
> 
> I think we should apply it. Hopefully nothing uses it, and nobody will
> notice. And if somebody *does* notice, the sooner we find out, the
> better.
> 
>              Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-10-14 18:25           ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-10-14 18:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Zijlstra, Andy Lutomirski, X86 ML, linux-kernel,
	Brian Gerst, Borislav Petkov, Linux API, Linus Torvalds,
	Kees Cook, Tycho Andersen, Tetsuo Handa

On Sat, Oct 1, 2016 at 3:37 AM, Jann Horn <jann@thejh.net> wrote:
> On Fri, Sep 30, 2016 at 07:01:13PM -0700, Andy Lutomirski wrote:
>> On an unrelated note, can we please lock down all the silly historical
>> *userspace* info leaks in /proc?  Nasty ones include: net, cmdline (at
>> the very least, only argv[0] should be visible if the reader lacks
>> ptrace access).
>>
>> Less nasty ones include: limits, sched, autogroup, comm, wchan,
>> schedstat, cpuset, cgroup, oom_*, sessionid, coredump_filter
>
> If that doesn't break stuff, I'm very much in favor of it.
>
>
>> uid_map, gid_map, etc are just screwed up.  They should be per
>> *namespace* somewhere, and they should require creds on the namespace.
>
> What do you have in mind? Something like
> /proc/namespaces/user:123456/{uid_map,gid_map,setgroups,parent_ns},
> with jumped fake symlinks to the directory and its entries in /proc/$pid/?
>

Something along those lines would be nice.

There's an unfortunate tension between having names for namespaces
(like 123456 in your example) for ease of use and *not* having names
so CRIU can restore them more easily.

>
>> timerslack is totally fscked up -- it allows ugo to write and it
>> checks the wrong creds.  Jann, does your series fix that?
>
> Nope. Never noticed that thing so far, probably because it was only
> added a few months ago. :/ Will add it to my series.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
@ 2016-10-14 18:25           ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-10-14 18:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Zijlstra, Andy Lutomirski, X86 ML,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Linux API, Linus Torvalds, Kees Cook,
	Tycho Andersen, Tetsuo Handa

On Sat, Oct 1, 2016 at 3:37 AM, Jann Horn <jann-XZ1E9jl8jIdeoWH0uzbU5w@public.gmane.org> wrote:
> On Fri, Sep 30, 2016 at 07:01:13PM -0700, Andy Lutomirski wrote:
>> On an unrelated note, can we please lock down all the silly historical
>> *userspace* info leaks in /proc?  Nasty ones include: net, cmdline (at
>> the very least, only argv[0] should be visible if the reader lacks
>> ptrace access).
>>
>> Less nasty ones include: limits, sched, autogroup, comm, wchan,
>> schedstat, cpuset, cgroup, oom_*, sessionid, coredump_filter
>
> If that doesn't break stuff, I'm very much in favor of it.
>
>
>> uid_map, gid_map, etc are just screwed up.  They should be per
>> *namespace* somewhere, and they should require creds on the namespace.
>
> What do you have in mind? Something like
> /proc/namespaces/user:123456/{uid_map,gid_map,setgroups,parent_ns},
> with jumped fake symlinks to the directory and its entries in /proc/$pid/?
>

Something along those lines would be nice.

There's an unfortunate tension between having names for namespaces
(like 123456 in your example) for ease of use and *not* having names
so CRIU can restore them more easily.

>
>> timerslack is totally fscked up -- it allows ugo to write and it
>> checks the wrong creds.  Jann, does your series fix that?
>
> Nope. Never noticed that thing so far, probably because it was only
> added a few months ago. :/ Will add it to my series.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-10-14 18:26       ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-10-14 18:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Kees Cook, Tycho Andersen

On Mon, Oct 3, 2016 at 4:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Oct 3, 2016 at 4:08 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Ping!
>>
>> We need to decide fairly soon whether to apply these (or perhaps just
>> patch 1 or just patches 2 and 3) for 4.9.  For any parts that aren't
>> applied, I'll send quick fixups to pin the stack in the offending
>> code.
>
> I think we should apply it. Hopefully nothing uses it, and nobody will
> notice. And if somebody *does* notice, the sooner we find out, the
> better.
>

Ingo?  If we're going to make this change, I think it would be nice to
do it before -rc1.  If not, I want to get the alternate fix in ASAP.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads
@ 2016-10-14 18:26       ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-10-14 18:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, X86 ML, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Brian Gerst, Borislav Petkov, Jann Horn, Linux API, Kees Cook,
	Tycho Andersen

On Mon, Oct 3, 2016 at 4:17 PM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Mon, Oct 3, 2016 at 4:08 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>
>> Ping!
>>
>> We need to decide fairly soon whether to apply these (or perhaps just
>> patch 1 or just patches 2 and 3) for 4.9.  For any parts that aren't
>> applied, I'll send quick fixups to pin the stack in the offending
>> code.
>
> I think we should apply it. Hopefully nothing uses it, and nobody will
> notice. And if somebody *does* notice, the sooner we find out, the
> better.
>

Ingo?  If we're going to make this change, I think it would be nice to
do it before -rc1.  If not, I want to get the alternate fix in ASAP.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat
  2016-10-14 18:25           ` Andy Lutomirski
  (?)
@ 2016-10-14 20:01           ` Tycho Andersen
  -1 siblings, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2016-10-14 20:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Peter Zijlstra, Andy Lutomirski, X86 ML, linux-kernel,
	Brian Gerst, Borislav Petkov, Linux API, Linus Torvalds,
	Kees Cook, Tetsuo Handa

On Fri, Oct 14, 2016 at 11:25:58AM -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 3:37 AM, Jann Horn <jann@thejh.net> wrote:
> > On Fri, Sep 30, 2016 at 07:01:13PM -0700, Andy Lutomirski wrote:
> >> On an unrelated note, can we please lock down all the silly historical
> >> *userspace* info leaks in /proc?  Nasty ones include: net, cmdline (at
> >> the very least, only argv[0] should be visible if the reader lacks
> >> ptrace access).
> >>
> >> Less nasty ones include: limits, sched, autogroup, comm, wchan,
> >> schedstat, cpuset, cgroup, oom_*, sessionid, coredump_filter
> >
> > If that doesn't break stuff, I'm very much in favor of it.
> >
> >
> >> uid_map, gid_map, etc are just screwed up.  They should be per
> >> *namespace* somewhere, and they should require creds on the namespace.
> >
> > What do you have in mind? Something like
> > /proc/namespaces/user:123456/{uid_map,gid_map,setgroups,parent_ns},
> > with jumped fake symlinks to the directory and its entries in /proc/$pid/?
> >
> 
> Something along those lines would be nice.
> 
> There's an unfortunate tension between having names for namespaces
> (like 123456 in your example) for ease of use and *not* having names
> so CRIU can restore them more easily.

As long as the namespaces aren't visible to the container itself, it
has been okay. For example, LXC (and IIUC OpenVZ as well) put
containers in a /lxc/<name> cgroup, which with cgroup namespaces isn't
visible to the container. CRIU implements support for rewriting this
prefix, so we can do renaming on the fly. Since most container engines
won't allow two containers with the same name, we avoid clashes at a
higher level before we end up with an actual "namespace ID" (cgroup
path) clash. We're doing a similar thing for AppArmor namespaces.

Of course, if the IDs are visible to the container like they would be
in this case, then it is much harder.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [tip:mm/urgent] fs/proc: Stop reporting eip and esp in /proc/PID/stat
  2016-09-30 17:58   ` Andy Lutomirski
  (?)
  (?)
@ 2016-10-20 11:13   ` tip-bot for Andy Lutomirski
  -1 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-10-20 11:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: brgerst, penguin-kernel, hpa, linux-api, linux-kernel, akpm,
	tglx, luto, keescook, mingo, bp, peterz, viro, torvalds,
	tycho.andersen, jann

Commit-ID:  0a1eb2d474edfe75466be6b4677ad84e5e8ca3f5
Gitweb:     http://git.kernel.org/tip/0a1eb2d474edfe75466be6b4677ad84e5e8ca3f5
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Fri, 30 Sep 2016 10:58:56 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 20 Oct 2016 09:21:41 +0200

fs/proc: Stop reporting eip and esp in /proc/PID/stat

Reporting these fields on a non-current task is dangerous.  If the
task is in any state other than normal kernel code, they may contain
garbage or even kernel addresses on some architectures.  (x86_64
used to do this.  I bet lots of architectures still do.)  With
CONFIG_THREAD_INFO_IN_TASK=y, it can OOPS, too.

As far as I know, there are no use programs that make any material
use of these fields, so just get rid of them.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/a5fed4c3f4e33ed25d4bb03567e329bc5a712bcc.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 fs/proc/array.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 89600fd..81818ad 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -412,10 +412,11 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	mm = get_task_mm(task);
 	if (mm) {
 		vsize = task_vsize(mm);
-		if (permitted) {
-			eip = KSTK_EIP(task);
-			esp = KSTK_ESP(task);
-		}
+		/*
+		 * esp and eip are intentionally zeroed out.  There is no
+		 * non-racy way to read them without freezing the task.
+		 * Programs that need reliable values can use ptrace(2).
+		 */
 	}
 
 	get_task_comm(tcomm, task);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:mm/urgent] fs/proc: Stop trying to report thread stacks
  2016-09-30 17:58 ` [PATCH 2/3] proc: Stop trying to report thread stacks Andy Lutomirski
@ 2016-10-20 11:13   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-10-20 11:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, bp, luto, peterz, linux-api, akpm, linux-kernel, viro,
	hannes, hpa, brgerst, jann, keescook, torvalds, tycho.andersen,
	tglx

Commit-ID:  b18cb64ead400c01bf1580eeba330ace51f8087d
Gitweb:     http://git.kernel.org/tip/b18cb64ead400c01bf1580eeba330ace51f8087d
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Fri, 30 Sep 2016 10:58:57 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 20 Oct 2016 09:21:41 +0200

fs/proc: Stop trying to report thread stacks

This reverts more of:

  b76437579d13 ("procfs: mark thread stack correctly in proc/<pid>/maps")

... which was partially reverted by:

  65376df58217 ("proc: revert /proc/<pid>/maps [stack:TID] annotation")

Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.

In current kernels, /proc/PID/maps (or /proc/TID/maps even for
threads) shows "[stack]" for VMAs in the mm's stack address range.

In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
target thread's stack's VMA.  This is racy, probably returns garbage
and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
KSTK_ESP is not safe to use on tasks that aren't known to be running
ordinary process-context kernel code.

This patch removes the difference and just shows "[stack]" for VMAs
in the mm's stack range.  This is IMO much more sensible -- the
actual "stack" address really is treated specially by the VM code,
and the current thread stack isn't even well-defined for programs
that frequently switch stacks on their own.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/filesystems/proc.txt | 26 --------------------------
 fs/proc/task_mmu.c                 | 29 ++++++++++-------------------
 fs/proc/task_nommu.c               | 28 ++++++++++------------------
 3 files changed, 20 insertions(+), 63 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 219ffd4..74329fd 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -395,32 +395,6 @@ is not associated with a file:
 
  or if empty, the mapping is anonymous.
 
-The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint
-of the individual tasks of a process. In this file you will see a mapping marked
-as [stack] if that task sees it as a stack. Hence, for the example above, the
-task-level map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this:
-
-08048000-08049000 r-xp 00000000 03:00 8312       /opt/test
-08049000-0804a000 rw-p 00001000 03:00 8312       /opt/test
-0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
-a7cb1000-a7cb2000 ---p 00000000 00:00 0
-a7cb2000-a7eb2000 rw-p 00000000 00:00 0
-a7eb2000-a7eb3000 ---p 00000000 00:00 0
-a7eb3000-a7ed5000 rw-p 00000000 00:00 0          [stack]
-a7ed5000-a8008000 r-xp 00000000 03:00 4222       /lib/libc.so.6
-a8008000-a800a000 r--p 00133000 03:00 4222       /lib/libc.so.6
-a800a000-a800b000 rw-p 00135000 03:00 4222       /lib/libc.so.6
-a800b000-a800e000 rw-p 00000000 00:00 0
-a800e000-a8022000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
-a8022000-a8023000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
-a8023000-a8024000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
-a8024000-a8027000 rw-p 00000000 00:00 0
-a8027000-a8043000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
-a8043000-a8044000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
-a8044000-a8045000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
-aff35000-aff4a000 rw-p 00000000 00:00 0
-ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
-
 The /proc/PID/smaps is an extension based on maps, showing the memory
 consumption for each of the process's mappings. For each of mappings there
 is a series of lines such as the following:
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6909582..35b92d8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -266,24 +266,15 @@ static int do_maps_open(struct inode *inode, struct file *file,
  * /proc/PID/maps that is the stack of the main task.
  */
 static int is_stack(struct proc_maps_private *priv,
-		    struct vm_area_struct *vma, int is_pid)
+		    struct vm_area_struct *vma)
 {
-	int stack = 0;
-
-	if (is_pid) {
-		stack = vma->vm_start <= vma->vm_mm->start_stack &&
-			vma->vm_end >= vma->vm_mm->start_stack;
-	} else {
-		struct inode *inode = priv->inode;
-		struct task_struct *task;
-
-		rcu_read_lock();
-		task = pid_task(proc_pid(inode), PIDTYPE_PID);
-		if (task)
-			stack = vma_is_stack_for_task(vma, task);
-		rcu_read_unlock();
-	}
-	return stack;
+	/*
+	 * We make no effort to guess what a given thread considers to be
+	 * its "stack".  It's not even well-defined for programs written
+	 * languages like Go.
+	 */
+	return vma->vm_start <= vma->vm_mm->start_stack &&
+		vma->vm_end >= vma->vm_mm->start_stack;
 }
 
 static void
@@ -354,7 +345,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
 			goto done;
 		}
 
-		if (is_stack(priv, vma, is_pid))
+		if (is_stack(priv, vma))
 			name = "[stack]";
 	}
 
@@ -1669,7 +1660,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 		seq_file_path(m, file, "\n\t= ");
 	} else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
 		seq_puts(m, " heap");
-	} else if (is_stack(proc_priv, vma, is_pid)) {
+	} else if (is_stack(proc_priv, vma)) {
 		seq_puts(m, " stack");
 	}
 
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index faacb0c..3717562 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -124,25 +124,17 @@ unsigned long task_statm(struct mm_struct *mm,
 }
 
 static int is_stack(struct proc_maps_private *priv,
-		    struct vm_area_struct *vma, int is_pid)
+		    struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int stack = 0;
-
-	if (is_pid) {
-		stack = vma->vm_start <= mm->start_stack &&
-			vma->vm_end >= mm->start_stack;
-	} else {
-		struct inode *inode = priv->inode;
-		struct task_struct *task;
-
-		rcu_read_lock();
-		task = pid_task(proc_pid(inode), PIDTYPE_PID);
-		if (task)
-			stack = vma_is_stack_for_task(vma, task);
-		rcu_read_unlock();
-	}
-	return stack;
+
+	/*
+	 * We make no effort to guess what a given thread considers to be
+	 * its "stack".  It's not even well-defined for programs written
+	 * languages like Go.
+	 */
+	return vma->vm_start <= mm->start_stack &&
+		vma->vm_end >= mm->start_stack;
 }
 
 /*
@@ -184,7 +176,7 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma,
 	if (file) {
 		seq_pad(m, ' ');
 		seq_file_path(m, file, "");
-	} else if (mm && is_stack(priv, vma, is_pid)) {
+	} else if (mm && is_stack(priv, vma)) {
 		seq_pad(m, ' ');
 		seq_printf(m, "[stack]");
 	}

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:mm/urgent] mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
  2016-09-30 17:58   ` Andy Lutomirski
  (?)
@ 2016-10-20 11:14   ` tip-bot for Andy Lutomirski
  -1 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-10-20 11:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, viro, tglx, bp, torvalds, hpa, tycho.andersen, brgerst,
	keescook, peterz, linux-kernel, jann, linux-api, akpm, mingo

Commit-ID:  d17af5056cf9e9fc05e68832f7c15687fcc12281
Gitweb:     http://git.kernel.org/tip/d17af5056cf9e9fc05e68832f7c15687fcc12281
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Fri, 30 Sep 2016 10:58:58 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 20 Oct 2016 09:21:41 +0200

mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()

Asking for a non-current task's stack can't be done without races
unless the task is frozen in kernel mode.  As far as I know,
vm_is_stack_for_task() never had a safe non-current use case.

The __unused annotation is because some KSTK_ESP implementations
ignore their parameter, which IMO is further justification for this
patch.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h       | 2 +-
 mm/util.c                | 4 +++-
 security/selinux/hooks.c | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e9caec6..a658a51 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1391,7 +1391,7 @@ static inline int stack_guard_page_end(struct vm_area_struct *vma,
 		!vma_growsup(vma->vm_next, addr);
 }
 
-int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t);
+int vma_is_stack_for_current(struct vm_area_struct *vma);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/util.c b/mm/util.c
index 662cddf..c174e89 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -230,8 +230,10 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 /* Check if the vma is being used as a stack by this task */
-int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t)
+int vma_is_stack_for_current(struct vm_area_struct *vma)
 {
+	struct task_struct * __maybe_unused t = current;
+
 	return (vma->vm_start <= KSTK_ESP(t) && vma->vm_end >= KSTK_ESP(t));
 }
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 0850579..09fd610 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3557,7 +3557,7 @@ static int selinux_file_mprotect(struct vm_area_struct *vma,
 		} else if (!vma->vm_file &&
 			   ((vma->vm_start <= vma->vm_mm->start_stack &&
 			     vma->vm_end >= vma->vm_mm->start_stack) ||
-			    vma_is_stack_for_task(vma, current))) {
+			    vma_is_stack_for_current(vma))) {
 			rc = current_has_perm(current, PROCESS__EXECSTACK);
 		} else if (vma->vm_file && vma->anon_vma) {
 			/*

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60
  2016-09-30 17:58   ` Andy Lutomirski
                     ` (2 preceding siblings ...)
  (?)
@ 2016-11-01 14:36   ` Tetsuo Handa
  2016-11-01 23:47     ` Linus Torvalds
  -1 siblings, 1 reply; 38+ messages in thread
From: Tetsuo Handa @ 2016-11-01 14:36 UTC (permalink / raw)
  To: luto, x86
  Cc: linux-kernel, brgerst, bp, jann, linux-api, torvalds, keescook,
	tycho.andersen

Hello.

Andy Lutomirski wrote:
> Reporting these fields on a non-current task is dangerous.  If the
> task is in any state other than normal kernel code, they may contain
> garbage or even kernel addresses on some architectures.  (x86_64
> used to do this.  I bet lots of architectures still do.)  With
> CONFIG_THREAD_INFO_IN_TASK, it can OOPS, too.
> 
> As far as I know, there are no use programs that make any material
> use of these fields, so just get rid of them.
> 
> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Cc: Tycho Andersen <tycho.andersen@canonical.com>
> Cc: Kees Cook <keescook@chromium.org>
> Reported-by: Jann Horn <jann@thejh.net>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  fs/proc/array.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 88c7de12197b..1bb1097e73b7 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -417,10 +417,11 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
>  	mm = get_task_mm(task);
>  	if (mm) {
>  		vsize = task_vsize(mm);
> -		if (permitted) {
> -			eip = KSTK_EIP(task);
> -			esp = KSTK_ESP(task);
> -		}
> +		/*
> +		 * esp and eip are intentionally zeroed out.  There is no
> +		 * non-racy way to read them without freezing the task.
> +		 * Programs that need reliable values can use ptrace(2).
> +		 */
>  	}
>  
>  	get_task_comm(tcomm, task);
> -- 
> 2.7.4

I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
Is this same reason?

[  580.778495] Out of memory: Kill process 10206 (a.out) score 998 or sacrifice child
[  580.778499] Killed process 10206 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  580.797408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  580.802963] a.out           x[  580.803660] BUG: unable to handle kernel 
paging request at ffffc900144dfc60
[  580.807153] IP: [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[  580.809313] PGD 7f4c0067 [  580.809875] PUD 7f4c1067 
PMD 47df1067 [  580.811690] PTE 0
[  580.812998] 
[  580.814155] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  580.816139] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc[  580.821830] oom_reaper: reaped process 10206 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  580.822492] Out of memory: Kill process 10208 (a.out) score 998 or sacrifice child
[  580.822496] Killed process 10208 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  580.824895] oom_reaper: reaped process 10208 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  580.833682]  ebtable_filter ebtables[  580.834453] Out of memory: Kill process 10210 (a.out) score 998 or sacrifice child
[  580.834458] Killed process 10210 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  580.839762]  ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter coretemp pcspkr sg i2c_piix4 vmw_vmci shpchp ip_tables sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci e1000 mptspi libahci drm scsi_transport_spi mptscsih mptbase i2c_core ata_piix libata
[  580.850620] CPU: 2 PID: 45 Comm: khungtaskd Tainted: G        W       4.9.0-rc3+ #83
[  580.853526] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  580.856842] task: ffff88007b54b7c0 task.stack: ffffc900004c0000
[  580.859169] RIP: 0010:[<ffffffff81026feb>]  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[  580.862264] RSP: 0018:ffffc900004c3db8  EFLAGS: 00010202
[  580.864343] RAX: ffffc900144dfc30 RBX: ffff8800438e1c00 RCX: 0000000000000000
[  580.867439] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800438e1c00
[  580.869910] RBP: ffffc900004c3db8 R08: 0000000000000001 R09: 0000000000000001
[  580.872963] R10: 0000000000000000 R11: 0000000000aaaaaa R12: 0000000000000007
[  580.875522] R13: 000000000000028a R14: 00000000003ffa8a R15: ffff8800438e1eb8
[  580.877387] oom_reaper: reaped process 10210 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  580.878738] Out of memory: Kill process 10212 (a.out) score 998 or sacrifice child
[  580.878743] Killed process 10212 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  580.887239] FS:  0000000000000000(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000
[  580.890017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  580.892628] CR2: ffffc900144dfc60 CR3: 0000000001c0c000 CR4: 00000000001406e0
[  580.895101] Stack:
[  580.896443]  ffffc900004c3de0 ffffffff810974c0 0000000000000000 ffff8800438e1c00
[  580.899033]  ffff8800438e1c00 ffffc900004c3e40 ffffffff8112a500 ffffffff8112a32d
[  580.904306]  000000000000003c ffff8800438e1c00 0000000000000003 000000010003e000
[  580.907040] Call Trace:
[  580.908547]  [<ffffffff810974c0>] sched_show_task+0x50/0x240
[  580.911435] oom_reaper: reaped process 10212 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  580.912449] Out of memory: Kill process 10214 (a.out) score 998 or sacrifice child
[  580.912453] Killed process 10214 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  580.919432] oom_reaper: reaped process 10214 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  580.920256] Out of memory: Kill process 10216 (a.out) score 998 or sacrifice child
[  580.920259] Killed process 10216 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  580.928793]  [<ffffffff8112a500>] watchdog+0x3d0/0x4f0
[  580.930774]  [<ffffffff8112a32d>] ? watchdog+0x1fd/0x4f0
[  580.932785]  [<ffffffff8112a130>] ? check_memalloc_stalling_tasks+0x820/0x820
[  580.935649]  [<ffffffff81089b4d>] kthread+0xfd/0x120
[  580.937594]  [<ffffffff81089a50>] ? kthread_park+0x60/0x60
[  580.939693]  [<ffffffff81089a50>] ? kthread_park+0x60/0x60
[  580.941743]  [<ffffffff816a4c57>] ret_from_fork+0x27/0x40
[  580.944608] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 
[  580.952519] RIP  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[  580.954654]  RSP <ffffc900004c3db8>
[  580.956272] CR2: ffffc900144dfc60
[  580.957861] ---[ end trace cd024114d281cfa4 ]---
[  580.959662] BUG: sleeping function called from invalid context at ./include/linux/sched.h:3138
[  580.962350] in_atomic(): 0, irqs_disabled(): 1, pid: 45, name: khungtaskd
[  580.964610] INFO: lockdep is turned off.
[  580.966236] irq event stamp: 88
[  580.967682] hardirqs last  enabled at (87): [  580.968588] [<ffffffff816a4075>] _raw_spin_unlock_irqrestore+0x55/0x70
[  580.970766] hardirqs last disabled at (88): [  580.971654] [<ffffffff8169ddb1>] __schedule+0x91/0x730
[  580.973574] softirqs last  enabled at (66): [  580.974607] [<ffffffff8106d422>] __do_softirq+0x192/0x220
[  580.976628] softirqs last disabled at (59): [  580.977528] [<ffffffff8106d754>] irq_exit+0xc4/0x100
[  580.979345] Preemption disabled at:[  580.980073] [<ffffffff810d1a7f>] wake_up_klogd+0xf/0x70
[  580.981951] CPU: 2 PID: 45 Comm: khungtaskd Tainted: G      D W       4.9.0-rc3+ #83
[  580.984297] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  580.987279]  ffffc900004c3e50 ffffffff813372bf 0000000000000000 ffff88007b54b7c0
[  580.989759]  ffffc900004c3e88 ffffffff8108fa2c ffffffff819799f2 0000000000000c42
[  580.992259]  0000000000000000 ffff88007b54b7c0 0000000000000000 ffffc900004c3eb0
[  580.994701] Call Trace:
[  580.995988]  [<ffffffff813372bf>] dump_stack+0x67/0x98
[  580.997835]  [<ffffffff8108fa2c>] ___might_sleep+0x16c/0x260
[  581.000291]  [<ffffffff8108fb65>] __might_sleep+0x45/0x80
[  581.002552]  [<ffffffff8107823e>] exit_signals+0x2e/0x2f0
[  581.004411]  [<ffffffff8108b991>] ? blocking_notifier_call_chain+0x11/0x20
[  581.006760]  [<ffffffff8106bbe6>] do_exit+0xb6/0xb10
[  581.008646]  [<ffffffff816a6627>] rewind_stack_do_exit+0x17/0x20
[  608.732005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [vmtoolsd:2075]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60
  2016-11-01 14:36   ` [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60 Tetsuo Handa
@ 2016-11-01 23:47     ` Linus Torvalds
  2016-11-02 10:50       ` Tetsuo Handa
  2016-11-03  7:10       ` [tip:sched/urgent] sched/core: Remove pointless printout " tip-bot for Linus Torvalds
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2016-11-01 23:47 UTC (permalink / raw)
  To: Tetsuo Handa, Peter Zijlstra, Ingo Molnar
  Cc: Andrew Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Brian Gerst, Borislav Petkov,
	Jann Horn, Linux API, Kees Cook, Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 2344 bytes --]

On Tue, Nov 1, 2016 at 8:36 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
> Is this same reason?

CONFIG_THREAD_INFO_IN_TASK is always set on x86, but I assume you also
did VMAP_STACK

And yes, it looks like it's the same "touching another process' stack"
issue, just in sched_show_task() called from check_hung_task(), which
seems to have been due to a watchdog triggering. I'm not sure what the
relationship is with the oom killer happening at the same time, but it
makes the whole thing fairly hard to read.

The cleaned-up oops looks like this:

> [  580.803660] BUG: unable to handle kernel paging request at ffffc900144dfc60
> [  580.807153] IP:  thread_saved_pc+0xb/0x20
> [  580.907040] Call Trace:
> [  580.908547]   sched_show_task+0x50/0x240
> [  580.928793]   watchdog+0x3d0/0x4f0
> [  580.930774]   ? watchdog+0x1fd/0x4f0
> [  580.932785]   ? check_memalloc_stalling_tasks+0x820/0x820
> [  580.935649]   kthread+0xfd/0x120
> [  580.937594]   ? kthread_park+0x60/0x60
> [  580.939693]   ? kthread_park+0x60/0x60
> [  580.941743]   ret_from_fork+0x27/0x40
> [  580.944608] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
> [  580.952519] RIP  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [  580.954654]  RSP <ffffc900004c3db8>
> [  580.956272] CR2: ffffc900144dfc60

So we have

  watchdog -> check_hung_uninterruptible_task -> check_hung_task ->
sched_show_task -> thread_saved_pc(), which oopses.

We just checked that task was TASK_UNINTERRUPTIBLE in that chain, but
clearly it races with it dying (due to oom), and so by the time er get
to thread_saved_pc() it's dead and the stack is gone.

Considering that we just print out  a useless hex number, not even a
symbol, and there's a big question mark whether this even makes sense
anyway, I suspect we should just remove it all.  The real information
would have come later as part of "show_stack()", which seems to be
doing the proper  try_get_task_stack().

So I _think_ the fix is to just remove this. Perhaps something like
the attached? Adding scheduler people since this is in their code..

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 776 bytes --]

 kernel/sched/core.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027f9e26..3c3022466331 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5196,17 +5196,8 @@ void sched_show_task(struct task_struct *p)
 		state = __ffs(state) + 1;
 	printk(KERN_INFO "%-15.15s %c", p->comm,
 		state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?');
-#if BITS_PER_LONG == 32
-	if (state == TASK_RUNNING)
-		printk(KERN_CONT " running  ");
-	else
-		printk(KERN_CONT " %08lx ", thread_saved_pc(p));
-#else
 	if (state == TASK_RUNNING)
 		printk(KERN_CONT "  running task    ");
-	else
-		printk(KERN_CONT " %016lx ", thread_saved_pc(p));
-#endif
 #ifdef CONFIG_DEBUG_STACK_USAGE
 	free = stack_not_used(p);
 #endif

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60
  2016-11-01 23:47     ` Linus Torvalds
@ 2016-11-02 10:50       ` Tetsuo Handa
  2016-11-02 14:05           ` Andy Lutomirski
                           ` (2 more replies)
  2016-11-03  7:10       ` [tip:sched/urgent] sched/core: Remove pointless printout " tip-bot for Linus Torvalds
  1 sibling, 3 replies; 38+ messages in thread
From: Tetsuo Handa @ 2016-11-02 10:50 UTC (permalink / raw)
  To: torvalds, peterz, mingo
  Cc: luto, x86, linux-kernel, brgerst, bp, jann, linux-api, keescook,
	tycho.andersen

Linus Torvalds wrote:
> On Tue, Nov 1, 2016 at 8:36 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
> > Is this same reason?
> 
> CONFIG_THREAD_INFO_IN_TASK is always set on x86, but I assume you also
> did VMAP_STACK

Yes. And I wrote a reproducer.

---------- Reproducer start ----------
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	if (fork() == 0)
		_exit(0);
	sleep(1);
	system("echo t > /proc/sysrq-trigger");
	return 0;
}
---------- Reproducer end ----------

---------- Serial console log start ----------
[  328.528734] a.out           x
[  328.529293] BUG: unable to handle kernel
[  328.530655] paging request at ffffc90001f43e18
[  328.531837] IP: [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[  328.533512] PGD 7f4c0067
[  328.533972] PUD 7f4c1067
[  328.535065] PMD 74cba067
[  328.535296] PTE 0

[  328.537173] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  328.538698] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter coretemp pcspkr sg i2c_piix4 shpchp vmw_vmci ip_tables sd_mod ata_generic pata_acpi serio_raw mptspi vmwgfx scsi_transport_spi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt mptscsih e1000 fb_sys_fops libahci ttm drm mptbase ata_piix i2c_core libata
[  328.552465] CPU: 0 PID: 4299 Comm: sh Tainted: G        W       4.9.0-rc3+ #83
[  328.554403] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  328.556939] task: ffff8800792b5380 task.stack: ffffc90001f58000
[  328.558686] RIP: 0010:[<ffffffff81026feb>]  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[  328.560926] RSP: 0018:ffffc90001f5bd28  EFLAGS: 00010202
[  328.562603] RAX: ffffc90001f43de8 RBX: ffff88007826d380 RCX: 0000000000000006
[  328.564507] RDX: 0000000000000000 RSI: ffffffff8197f2d1 RDI: ffff88007826d380
[  328.566437] RBP: ffffc90001f5bd28 R08: 0000000000000001 R09: 0000000000000001
[  328.568354] R10: 0000000000000001 R11: 0000000000000004 R12: 0000000000000007
[  328.570266] R13: ffff88007826d638 R14: ffff88007826d380 R15: 0000000000000002
[  328.572197] FS:  00007ff7b501e740(0000) GS:ffff88007c200000(0000) knlGS:0000000000000000
[  328.574303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  328.576006] CR2: ffffc90001f43e18 CR3: 000000007894c000 CR4: 00000000001406f0
[  328.577995] Stack:
[  328.579024]  ffffc90001f5bd50 ffffffff810974c0 ffffc90001f5bd50 ffff88007826d380
[  328.581219]  0000000000000000 ffffc90001f5bd88 ffffffff81097767 ffffffff810976b0
[  328.583300]  ffffffff81c74e60 0000000000000074 0000000000000000 0000000000000007
[  328.585404] Call Trace:
[  328.586531]  [<ffffffff810974c0>] sched_show_task+0x50/0x240
[  328.588184]  [<ffffffff81097767>] show_state_filter+0xb7/0x190
[  328.589860]  [<ffffffff810976b0>] ? sched_show_task+0x240/0x240
[  328.591553]  [<ffffffff813fd4fb>] sysrq_handle_showstate+0xb/0x20
[  328.593304]  [<ffffffff813fdce6>] __handle_sysrq+0x136/0x220
[  328.594992]  [<ffffffff813fdbb0>] ? __sysrq_get_key_op+0x30/0x30
[  328.596678]  [<ffffffff813fe1f1>] write_sysrq_trigger+0x41/0x50
[  328.598386]  [<ffffffff81249c88>] proc_reg_write+0x38/0x70
[  328.600038]  [<ffffffff811dc802>] __vfs_write+0x32/0x140
[  328.601604]  [<ffffffff810dc797>] ? rcu_read_lock_sched_held+0x87/0x90
[  328.603365]  [<ffffffff810dcb2a>] ? rcu_sync_lockdep_assert+0x2a/0x50
[  328.605111]  [<ffffffff811e0279>] ? __sb_start_write+0x189/0x240
[  328.606735]  [<ffffffff811dd642>] ? vfs_write+0x182/0x1b0
[  328.608278]  [<ffffffff811dd570>] vfs_write+0xb0/0x1b0
[  328.609777]  [<ffffffff81002240>] ? syscall_trace_enter+0x1b0/0x240
[  328.611513]  [<ffffffff811dea13>] SyS_write+0x53/0xc0
[  328.612989]  [<ffffffff81353b63>] ? __this_cpu_preempt_check+0x13/0x20
[  328.614757]  [<ffffffff81002511>] do_syscall_64+0x61/0x1d0
[  328.616329]  [<ffffffff816a4aa4>] entry_SYSCALL64_slow_path+0x25/0x25
[  328.618057] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
[  328.624402] RIP  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[  328.626124]  RSP <ffffc90001f5bd28>
[  328.627375] CR2: ffffc90001f43e18
[  328.628646] ---[ end trace 70b31f25a2ce0c0c ]---
---------- Serial console log end ----------

> Considering that we just print out  a useless hex number, not even a
> symbol, and there's a big question mark whether this even makes sense
> anyway, I suspect we should just remove it all.  The real information
> would have come later as part of "show_stack()", which seems to be
> doing the proper  try_get_task_stack().
> 
> So I _think_ the fix is to just remove this. Perhaps something like
> the attached? Adding scheduler people since this is in their code..

That is not sufficient, for another Oops occurs inside stack_not_used().
Since I don't want to break stack_not_used(), can we tolerate nested
try_get_task_stack() usage and protect the whole sched_show_task()?

----------------------------------------
>From 9cf83a0a8c48d281434b040694835743940a88b2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 2 Nov 2016 19:31:07 +0900
Subject: [PATCH] sched: Fix oops in sched_show_task()

When CONFIG_VMAP_STACK=y, it is possible that an exited thread remains in
the task list after its stack pointer was already set to NULL. Therefore,
thread_saved_pc() and stack_not_used() in sched_show_task() will trigger
NULL pointer dereference if an attempt to dump such thread's traces
(e.g. SysRq-t, khungtaskd) is made.

Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027..9abf66b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5192,6 +5192,8 @@ void sched_show_task(struct task_struct *p)
 	int ppid;
 	unsigned long state = p->state;
 
+	if (!try_get_task_stack(p))
+		return;
 	if (state)
 		state = __ffs(state) + 1;
 	printk(KERN_INFO "%-15.15s %c", p->comm,
@@ -5221,6 +5223,7 @@ void sched_show_task(struct task_struct *p)
 
 	print_worker_info(KERN_INFO, p);
 	show_stack(p, NULL);
+	put_task_stack(p);
 }
 
 void show_state_filter(unsigned long state_filter)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60
@ 2016-11-02 14:05           ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-11-02 14:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Andrew Lutomirski,
	X86 ML, linux-kernel, Brian Gerst, Borislav Petkov, Jann Horn,
	Linux API, Kees Cook, Tycho Andersen

On Wed, Nov 2, 2016 at 3:50 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Linus Torvalds wrote:
>> On Tue, Nov 1, 2016 at 8:36 AM, Tetsuo Handa
>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> >
>> > I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
>> > Is this same reason?
>>
>> CONFIG_THREAD_INFO_IN_TASK is always set on x86, but I assume you also
>> did VMAP_STACK
>
> Yes. And I wrote a reproducer.
>
> ---------- Reproducer start ----------
> #include <unistd.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
>         if (fork() == 0)
>                 _exit(0);
>         sleep(1);
>         system("echo t > /proc/sysrq-trigger");
>         return 0;
> }
> ---------- Reproducer end ----------
>
> ---------- Serial console log start ----------
> [  328.528734] a.out           x
> [  328.529293] BUG: unable to handle kernel
> [  328.530655] paging request at ffffc90001f43e18
> [  328.531837] IP: [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [  328.533512] PGD 7f4c0067
> [  328.533972] PUD 7f4c1067
> [  328.535065] PMD 74cba067
> [  328.535296] PTE 0
>
> [  328.537173] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  328.538698] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter coretemp pcspkr sg i2c_piix4 shpchp vmw_vmci ip_tables sd_mod ata_generic pata_acpi serio_raw mptspi vmwgfx scsi_transport_spi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt mptscsih e1000 fb_sys_fops libahci ttm drm mptbase ata_piix i2c_core libata
> [  328.552465] CPU: 0 PID: 4299 Comm: sh Tainted: G        W       4.9.0-rc3+ #83
> [  328.554403] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  328.556939] task: ffff8800792b5380 task.stack: ffffc90001f58000
> [  328.558686] RIP: 0010:[<ffffffff81026feb>]  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [  328.560926] RSP: 0018:ffffc90001f5bd28  EFLAGS: 00010202
> [  328.562603] RAX: ffffc90001f43de8 RBX: ffff88007826d380 RCX: 0000000000000006
> [  328.564507] RDX: 0000000000000000 RSI: ffffffff8197f2d1 RDI: ffff88007826d380
> [  328.566437] RBP: ffffc90001f5bd28 R08: 0000000000000001 R09: 0000000000000001
> [  328.568354] R10: 0000000000000001 R11: 0000000000000004 R12: 0000000000000007
> [  328.570266] R13: ffff88007826d638 R14: ffff88007826d380 R15: 0000000000000002
> [  328.572197] FS:  00007ff7b501e740(0000) GS:ffff88007c200000(0000) knlGS:0000000000000000
> [  328.574303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  328.576006] CR2: ffffc90001f43e18 CR3: 000000007894c000 CR4: 00000000001406f0
> [  328.577995] Stack:
> [  328.579024]  ffffc90001f5bd50 ffffffff810974c0 ffffc90001f5bd50 ffff88007826d380
> [  328.581219]  0000000000000000 ffffc90001f5bd88 ffffffff81097767 ffffffff810976b0
> [  328.583300]  ffffffff81c74e60 0000000000000074 0000000000000000 0000000000000007
> [  328.585404] Call Trace:
> [  328.586531]  [<ffffffff810974c0>] sched_show_task+0x50/0x240
> [  328.588184]  [<ffffffff81097767>] show_state_filter+0xb7/0x190
> [  328.589860]  [<ffffffff810976b0>] ? sched_show_task+0x240/0x240
> [  328.591553]  [<ffffffff813fd4fb>] sysrq_handle_showstate+0xb/0x20
> [  328.593304]  [<ffffffff813fdce6>] __handle_sysrq+0x136/0x220
> [  328.594992]  [<ffffffff813fdbb0>] ? __sysrq_get_key_op+0x30/0x30
> [  328.596678]  [<ffffffff813fe1f1>] write_sysrq_trigger+0x41/0x50
> [  328.598386]  [<ffffffff81249c88>] proc_reg_write+0x38/0x70
> [  328.600038]  [<ffffffff811dc802>] __vfs_write+0x32/0x140
> [  328.601604]  [<ffffffff810dc797>] ? rcu_read_lock_sched_held+0x87/0x90
> [  328.603365]  [<ffffffff810dcb2a>] ? rcu_sync_lockdep_assert+0x2a/0x50
> [  328.605111]  [<ffffffff811e0279>] ? __sb_start_write+0x189/0x240
> [  328.606735]  [<ffffffff811dd642>] ? vfs_write+0x182/0x1b0
> [  328.608278]  [<ffffffff811dd570>] vfs_write+0xb0/0x1b0
> [  328.609777]  [<ffffffff81002240>] ? syscall_trace_enter+0x1b0/0x240
> [  328.611513]  [<ffffffff811dea13>] SyS_write+0x53/0xc0
> [  328.612989]  [<ffffffff81353b63>] ? __this_cpu_preempt_check+0x13/0x20
> [  328.614757]  [<ffffffff81002511>] do_syscall_64+0x61/0x1d0
> [  328.616329]  [<ffffffff816a4aa4>] entry_SYSCALL64_slow_path+0x25/0x25
> [  328.618057] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
> [  328.624402] RIP  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [  328.626124]  RSP <ffffc90001f5bd28>
> [  328.627375] CR2: ffffc90001f43e18
> [  328.628646] ---[ end trace 70b31f25a2ce0c0c ]---
> ---------- Serial console log end ----------
>
>> Considering that we just print out  a useless hex number, not even a
>> symbol, and there's a big question mark whether this even makes sense
>> anyway, I suspect we should just remove it all.  The real information
>> would have come later as part of "show_stack()", which seems to be
>> doing the proper  try_get_task_stack().
>>
>> So I _think_ the fix is to just remove this. Perhaps something like
>> the attached? Adding scheduler people since this is in their code..
>
> That is not sufficient, for another Oops occurs inside stack_not_used().
> Since I don't want to break stack_not_used(), can we tolerate nested
> try_get_task_stack() usage and protect the whole sched_show_task()?
>
> ----------------------------------------
> >From 9cf83a0a8c48d281434b040694835743940a88b2 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 2 Nov 2016 19:31:07 +0900
> Subject: [PATCH] sched: Fix oops in sched_show_task()
>
> When CONFIG_VMAP_STACK=y, it is possible that an exited thread remains in

Nit: It's CONFIG_THREAD_INFO_IN_TASK=y that does this.

This patch looks fine to me.  Linus, your patch also looks almost good
(I think the lines you deleted were spaced like that to preserve
output alignment, which may or may not matter), and maybe it would
make sense to apply both.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60
@ 2016-11-02 14:05           ` Andy Lutomirski
  0 siblings, 0 replies; 38+ messages in thread
From: Andy Lutomirski @ 2016-11-02 14:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Andrew Lutomirski,
	X86 ML, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Kees Cook, Tycho Andersen

On Wed, Nov 2, 2016 at 3:50 AM, Tetsuo Handa
<penguin-kernel-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> wrote:
> Linus Torvalds wrote:
>> On Tue, Nov 1, 2016 at 8:36 AM, Tetsuo Handa
>> <penguin-kernel-1yMVhJb1mP/7nzcFbJAaVXf5DAMn2ifp@public.gmane.org> wrote:
>> >
>> > I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
>> > Is this same reason?
>>
>> CONFIG_THREAD_INFO_IN_TASK is always set on x86, but I assume you also
>> did VMAP_STACK
>
> Yes. And I wrote a reproducer.
>
> ---------- Reproducer start ----------
> #include <unistd.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
>         if (fork() == 0)
>                 _exit(0);
>         sleep(1);
>         system("echo t > /proc/sysrq-trigger");
>         return 0;
> }
> ---------- Reproducer end ----------
>
> ---------- Serial console log start ----------
> [  328.528734] a.out           x
> [  328.529293] BUG: unable to handle kernel
> [  328.530655] paging request at ffffc90001f43e18
> [  328.531837] IP: [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [  328.533512] PGD 7f4c0067
> [  328.533972] PUD 7f4c1067
> [  328.535065] PMD 74cba067
> [  328.535296] PTE 0
>
> [  328.537173] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  328.538698] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter coretemp pcspkr sg i2c_piix4 shpchp vmw_vmci ip_tables sd_mod ata_generic pata_acpi serio_raw mptspi vmwgfx scsi_transport_spi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt mptscsih e1000 fb_sys_fops libahci ttm drm mptbase ata_piix i2c_core libata
> [  328.552465] CPU: 0 PID: 4299 Comm: sh Tainted: G        W       4.9.0-rc3+ #83
> [  328.554403] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  328.556939] task: ffff8800792b5380 task.stack: ffffc90001f58000
> [  328.558686] RIP: 0010:[<ffffffff81026feb>]  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [  328.560926] RSP: 0018:ffffc90001f5bd28  EFLAGS: 00010202
> [  328.562603] RAX: ffffc90001f43de8 RBX: ffff88007826d380 RCX: 0000000000000006
> [  328.564507] RDX: 0000000000000000 RSI: ffffffff8197f2d1 RDI: ffff88007826d380
> [  328.566437] RBP: ffffc90001f5bd28 R08: 0000000000000001 R09: 0000000000000001
> [  328.568354] R10: 0000000000000001 R11: 0000000000000004 R12: 0000000000000007
> [  328.570266] R13: ffff88007826d638 R14: ffff88007826d380 R15: 0000000000000002
> [  328.572197] FS:  00007ff7b501e740(0000) GS:ffff88007c200000(0000) knlGS:0000000000000000
> [  328.574303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  328.576006] CR2: ffffc90001f43e18 CR3: 000000007894c000 CR4: 00000000001406f0
> [  328.577995] Stack:
> [  328.579024]  ffffc90001f5bd50 ffffffff810974c0 ffffc90001f5bd50 ffff88007826d380
> [  328.581219]  0000000000000000 ffffc90001f5bd88 ffffffff81097767 ffffffff810976b0
> [  328.583300]  ffffffff81c74e60 0000000000000074 0000000000000000 0000000000000007
> [  328.585404] Call Trace:
> [  328.586531]  [<ffffffff810974c0>] sched_show_task+0x50/0x240
> [  328.588184]  [<ffffffff81097767>] show_state_filter+0xb7/0x190
> [  328.589860]  [<ffffffff810976b0>] ? sched_show_task+0x240/0x240
> [  328.591553]  [<ffffffff813fd4fb>] sysrq_handle_showstate+0xb/0x20
> [  328.593304]  [<ffffffff813fdce6>] __handle_sysrq+0x136/0x220
> [  328.594992]  [<ffffffff813fdbb0>] ? __sysrq_get_key_op+0x30/0x30
> [  328.596678]  [<ffffffff813fe1f1>] write_sysrq_trigger+0x41/0x50
> [  328.598386]  [<ffffffff81249c88>] proc_reg_write+0x38/0x70
> [  328.600038]  [<ffffffff811dc802>] __vfs_write+0x32/0x140
> [  328.601604]  [<ffffffff810dc797>] ? rcu_read_lock_sched_held+0x87/0x90
> [  328.603365]  [<ffffffff810dcb2a>] ? rcu_sync_lockdep_assert+0x2a/0x50
> [  328.605111]  [<ffffffff811e0279>] ? __sb_start_write+0x189/0x240
> [  328.606735]  [<ffffffff811dd642>] ? vfs_write+0x182/0x1b0
> [  328.608278]  [<ffffffff811dd570>] vfs_write+0xb0/0x1b0
> [  328.609777]  [<ffffffff81002240>] ? syscall_trace_enter+0x1b0/0x240
> [  328.611513]  [<ffffffff811dea13>] SyS_write+0x53/0xc0
> [  328.612989]  [<ffffffff81353b63>] ? __this_cpu_preempt_check+0x13/0x20
> [  328.614757]  [<ffffffff81002511>] do_syscall_64+0x61/0x1d0
> [  328.616329]  [<ffffffff816a4aa4>] entry_SYSCALL64_slow_path+0x25/0x25
> [  328.618057] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
> [  328.624402] RIP  [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [  328.626124]  RSP <ffffc90001f5bd28>
> [  328.627375] CR2: ffffc90001f43e18
> [  328.628646] ---[ end trace 70b31f25a2ce0c0c ]---
> ---------- Serial console log end ----------
>
>> Considering that we just print out  a useless hex number, not even a
>> symbol, and there's a big question mark whether this even makes sense
>> anyway, I suspect we should just remove it all.  The real information
>> would have come later as part of "show_stack()", which seems to be
>> doing the proper  try_get_task_stack().
>>
>> So I _think_ the fix is to just remove this. Perhaps something like
>> the attached? Adding scheduler people since this is in their code..
>
> That is not sufficient, for another Oops occurs inside stack_not_used().
> Since I don't want to break stack_not_used(), can we tolerate nested
> try_get_task_stack() usage and protect the whole sched_show_task()?
>
> ----------------------------------------
> >From 9cf83a0a8c48d281434b040694835743940a88b2 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>
> Date: Wed, 2 Nov 2016 19:31:07 +0900
> Subject: [PATCH] sched: Fix oops in sched_show_task()
>
> When CONFIG_VMAP_STACK=y, it is possible that an exited thread remains in

Nit: It's CONFIG_THREAD_INFO_IN_TASK=y that does this.

This patch looks fine to me.  Linus, your patch also looks almost good
(I think the lines you deleted were spaced like that to preserve
output alignment, which may or may not matter), and maybe it would
make sense to apply both.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60
  2016-11-02 10:50       ` Tetsuo Handa
  2016-11-02 14:05           ` Andy Lutomirski
@ 2016-11-02 14:54         ` Linus Torvalds
  2016-11-03  6:32           ` Ingo Molnar
  2016-11-03  7:09         ` [tip:sched/urgent] sched/core: Fix oops in sched_show_task() tip-bot for Tetsuo Handa
  2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2016-11-02 14:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Kees Cook, Tycho Andersen

On Wed, Nov 2, 2016 at 4:50 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>> So I _think_ the fix is to just remove this. Perhaps something like
>> the attached? Adding scheduler people since this is in their code..
>
> That is not sufficient, for another Oops occurs inside stack_not_used().
> Since I don't want to break stack_not_used(), can we tolerate nested
> try_get_task_stack() usage and protect the whole sched_show_task()?

Sure, looks good to me.

But as Andy says, we should probably apply my patch too, just to get
rid of the useless hex printk and the ugly ifdeffery.

PeterZ/Ingo?

                Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60
  2016-11-02 14:54         ` Linus Torvalds
@ 2016-11-03  6:32           ` Ingo Molnar
  0 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2016-11-03  6:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, Peter Zijlstra, Ingo Molnar, Andrew Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List, Brian Gerst,
	Borislav Petkov, Jann Horn, Linux API, Kees Cook, Tycho Andersen


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Nov 2, 2016 at 4:50 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >>
> >> So I _think_ the fix is to just remove this. Perhaps something like
> >> the attached? Adding scheduler people since this is in their code..
> >
> > That is not sufficient, for another Oops occurs inside stack_not_used().
> > Since I don't want to break stack_not_used(), can we tolerate nested
> > try_get_task_stack() usage and protect the whole sched_show_task()?
> 
> Sure, looks good to me.
> 
> But as Andy says, we should probably apply my patch too, just to get
> rid of the useless hex printk and the ugly ifdeffery.

Agreed, and I have applied both fixes to sched/urgent.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [tip:sched/urgent] sched/core: Fix oops in sched_show_task()
  2016-11-02 10:50       ` Tetsuo Handa
  2016-11-02 14:05           ` Andy Lutomirski
  2016-11-02 14:54         ` Linus Torvalds
@ 2016-11-03  7:09         ` tip-bot for Tetsuo Handa
  2 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Tetsuo Handa @ 2016-11-03  7:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, tglx, hpa, mingo, penguin-kernel, luto, peterz, torvalds

Commit-ID:  382005027fedc50b28d40ae64ef1461cca38953e
Gitweb:     http://git.kernel.org/tip/382005027fedc50b28d40ae64ef1461cca38953e
Author:     Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
AuthorDate: Wed, 2 Nov 2016 19:50:29 +0900
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 3 Nov 2016 07:27:34 +0100

sched/core: Fix oops in sched_show_task()

When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.

Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.

Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027..9abf66b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5192,6 +5192,8 @@ void sched_show_task(struct task_struct *p)
 	int ppid;
 	unsigned long state = p->state;
 
+	if (!try_get_task_stack(p))
+		return;
 	if (state)
 		state = __ffs(state) + 1;
 	printk(KERN_INFO "%-15.15s %c", p->comm,
@@ -5221,6 +5223,7 @@ void sched_show_task(struct task_struct *p)
 
 	print_worker_info(KERN_INFO, p);
 	show_stack(p, NULL);
+	put_task_stack(p);
 }
 
 void show_state_filter(unsigned long state_filter)

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:sched/urgent] sched/core: Remove pointless printout in sched_show_task()
  2016-11-01 23:47     ` Linus Torvalds
  2016-11-02 10:50       ` Tetsuo Handa
@ 2016-11-03  7:10       ` tip-bot for Linus Torvalds
  1 sibling, 0 replies; 38+ messages in thread
From: tip-bot for Linus Torvalds @ 2016-11-03  7:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, tglx, hpa, torvalds, linux-kernel, peterz, mingo, penguin-kernel

Commit-ID:  8243d5597793b5e85143c9a935e1b971c59740a9
Gitweb:     http://git.kernel.org/tip/8243d5597793b5e85143c9a935e1b971c59740a9
Author:     Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Tue, 1 Nov 2016 17:47:18 -0600
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 3 Nov 2016 07:31:34 +0100

sched/core: Remove pointless printout in sched_show_task()

In sched_show_task() we print out a useless hex number, not even a
symbol, and there's a big question mark whether this even makes sense
anyway, I suspect we should just remove it all.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9abf66b..154fd68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5198,17 +5198,8 @@ void sched_show_task(struct task_struct *p)
 		state = __ffs(state) + 1;
 	printk(KERN_INFO "%-15.15s %c", p->comm,
 		state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?');
-#if BITS_PER_LONG == 32
-	if (state == TASK_RUNNING)
-		printk(KERN_CONT " running  ");
-	else
-		printk(KERN_CONT " %08lx ", thread_saved_pc(p));
-#else
 	if (state == TASK_RUNNING)
 		printk(KERN_CONT "  running task    ");
-	else
-		printk(KERN_CONT " %016lx ", thread_saved_pc(p));
-#endif
 #ifdef CONFIG_DEBUG_STACK_USAGE
 	free = stack_not_used(p);
 #endif

^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-11-03  7:10 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30 17:58 [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads Andy Lutomirski
2016-09-30 17:58 ` Andy Lutomirski
2016-09-30 17:58 ` [PATCH 1/3] proc: Stop reporting eip and esp in /proc/PID/stat Andy Lutomirski
2016-09-30 17:58   ` Andy Lutomirski
2016-09-30 18:56   ` Jann Horn
2016-09-30 18:56     ` Jann Horn
2016-10-01  2:01     ` Andy Lutomirski
2016-10-01  2:01       ` Andy Lutomirski
2016-10-01  4:22       ` Linus Torvalds
2016-10-01  4:22         ` Linus Torvalds
2016-10-01 10:37       ` Jann Horn
2016-10-01 10:37         ` Jann Horn
2016-10-14 18:25         ` Andy Lutomirski
2016-10-14 18:25           ` Andy Lutomirski
2016-10-14 20:01           ` Tycho Andersen
2016-10-20 11:13   ` [tip:mm/urgent] fs/proc: " tip-bot for Andy Lutomirski
2016-11-01 14:36   ` [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60 Tetsuo Handa
2016-11-01 23:47     ` Linus Torvalds
2016-11-02 10:50       ` Tetsuo Handa
2016-11-02 14:05         ` Andy Lutomirski
2016-11-02 14:05           ` Andy Lutomirski
2016-11-02 14:54         ` Linus Torvalds
2016-11-03  6:32           ` Ingo Molnar
2016-11-03  7:09         ` [tip:sched/urgent] sched/core: Fix oops in sched_show_task() tip-bot for Tetsuo Handa
2016-11-03  7:10       ` [tip:sched/urgent] sched/core: Remove pointless printout " tip-bot for Linus Torvalds
2016-09-30 17:58 ` [PATCH 2/3] proc: Stop trying to report thread stacks Andy Lutomirski
2016-10-20 11:13   ` [tip:mm/urgent] fs/proc: " tip-bot for Andy Lutomirski
2016-09-30 17:58 ` [PATCH 3/3] mm: Change vm_is_stack_for_task() to vm_is_stack_for_current() Andy Lutomirski
2016-09-30 17:58   ` Andy Lutomirski
2016-10-20 11:14   ` [tip:mm/urgent] " tip-bot for Andy Lutomirski
2016-10-03 23:08 ` [PATCH 0/3] ABI CHANGE!!! Remove questionable remote SP reads Andy Lutomirski
2016-10-03 23:08   ` Andy Lutomirski
2016-10-03 23:17   ` Linus Torvalds
2016-10-03 23:17     ` Linus Torvalds
2016-10-04  7:06     ` Raymond Jennings
2016-10-04  7:06       ` Raymond Jennings
2016-10-14 18:26     ` Andy Lutomirski
2016-10-14 18:26       ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.