All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
@ 2015-01-14  0:20 Calvin Owens
  2015-01-14  0:23 ` Calvin Owens
                   ` (3 more replies)
  0 siblings, 4 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-14  0:20 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman,
	Al Viro, Kirill A. Shutemov, Peter Feiner, Grant Likely
  Cc: Siddhesh Poyarekar, linux-kernel, kernel-team, calvinowens

Commit b76437579d1344b6 ("procfs: mark thread stack correctly in
proc/<pid>/maps") introduced logic to mark thread stacks with the
"[stack:%d]" marker in /proc/<pid>/maps.

This causes reading /proc/<pid>/maps to take O(N^2) time, where N is
the number of threads sharing an address space, since each line of
output requires iterating over the VMA list looking for ranges that
correspond to the stack pointer in any task's register set. When
dealing with highly-threaded Java applications, reading this file can
take hours and trigger softlockup dumps.

Eliminating the "[stack:%d]" marker is not a viable option since it's
been there for some time, and I don't see a way to do the stack check
more efficiently that wouldn't end up making the whole thing really
ugly.

The use case I'm specifically concerned with is the lsof command, so
this patch adds an additional file, "mapped_files", that simply
iterates over the VMAs associated with the task and outputs a
newline-delimited list of the pathnames of the files associated with
the VMAs, if any.

This gives lsof and suchlike a way to determine the pathnames of files
mapped into a process without incurring the O(N^2) behavior of the
maps file.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
I'm also sending a simple repro program as a reply to this E-Mail.

 fs/proc/base.c     |  1 +
 fs/proc/internal.h |  1 +
 fs/proc/task_mmu.c | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..15f8bd0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2564,6 +2564,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
 	ONE("statm",      S_IRUGO, proc_pid_statm),
 	REG("maps",       S_IRUGO, proc_pid_maps_operations),
+	REG("mapped_files", S_IRUGO, proc_mapped_files_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
 #endif
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 6fcdba5..a09bbdd 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -284,6 +284,7 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode);
 
 extern const struct file_operations proc_pid_maps_operations;
 extern const struct file_operations proc_tid_maps_operations;
+extern const struct file_operations proc_mapped_files_operations;
 extern const struct file_operations proc_pid_numa_maps_operations;
 extern const struct file_operations proc_tid_numa_maps_operations;
 extern const struct file_operations proc_pid_smaps_operations;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 246eae8..bc101e0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -412,6 +412,38 @@ const struct file_operations proc_tid_maps_operations = {
 	.release	= proc_map_release,
 };
 
+static int show_next_mapped_file(struct seq_file *m, void *v)
+{
+	struct vm_area_struct *vma = v;
+	struct file *file = vma->vm_file;
+
+	if (file) {
+		seq_path(m, &file->f_path, "\n");
+		seq_putc(m, '\n');
+	}
+
+	return 0;
+}
+
+static const struct seq_operations mapped_files_seq_op = {
+	.start	= m_start,
+	.next	= m_next,
+	.stop	= m_stop,
+	.show	= show_next_mapped_file,
+};
+
+static int mapped_files_open(struct inode *inode, struct file *file)
+{
+	return do_maps_open(inode, file, &mapped_files_seq_op);
+}
+
+const struct file_operations proc_mapped_files_operations = {
+	.open		= mapped_files_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release_private,
+};
+
 /*
  * Proportional Set Size(PSS): my share of RSS.
  *
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14  0:20 [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files Calvin Owens
@ 2015-01-14  0:23 ` Calvin Owens
  2015-01-14 14:13 ` Rasmus Villemoes
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-14  0:23 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman,
	Al Viro, Kirill A. Shutemov, Peter Feiner, Grant Likely
  Cc: Siddhesh Poyarekar, linux-kernel, kernel-team, calvinowens

Here's a simple program to trigger the issue with /proc/<pid>/maps.

Thanks,
Calvin

/* Simple program to reproduce O(N^2) behavior reading /proc/<pid>/maps
 *
 * Example on a random server:
 *
 * 	$ ./map_repro 0
 * 	Spawning 0 threads
 * 	Reading /proc/self/maps... read 2189 bytes in 1 syscalls in 33us!
 * 	$ ./map_repro 10
 * 	Spawning 10 threads
 * 	Reading /proc/self/maps... read 3539 bytes in 1 syscalls in 55us!
 * 	$ ./map_repro 100
 * 	Spawning 100 threads
 * 	Reading /proc/self/maps... read 15689 bytes in 4 syscalls in 373us!
 * 	$ ./map_repro 1000
 * 	Spawning 1000 threads
 * 	Reading /proc/self/maps... read 137189 bytes in 34 syscalls in 32376us!
 * 	$ ./map_repro 2000
 * 	Spawning 2000 threads
 * 	Reading /proc/self/maps... read 272189 bytes in 68 syscalls in 119980us!
 * 	$ ./map_repro 4000
 * 	Spawning 4000 threads
 * 	Reading /proc/self/maps... read 544912 bytes in 134 syscalls in 712200us!
 * 	$ ./map_repro 8000
 * 	Spawning 8000 threads
 * 	Reading /proc/self/maps... read 1090189 bytes in 268 syscalls in 3650718us!
 * 	$ ./map_repro 16000
 * 	Spawning 16000 threads
 * 	Reading /proc/self/maps... read 2178189 bytes in 534 syscalls in 42701311us!
 */

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <limits.h>
#include <pthread.h>
#include <unistd.h>
#include <time.h>
#include <fcntl.h>

static char buf[65536] = {0};
static void time_maps_read(void)
{
	struct timespec then, now;
	long usec_elapsed;
	int ret, fd;
	int count = 0;
	int rd = 0;

	fd = open("/proc/self/maps", O_RDONLY);
	if (fd == -1) {
		printf("Couldn't open /proc/self/maps, bailing...\n");
		return;
	}

	printf("Reading /proc/self/maps... ");
	ret = clock_gettime(CLOCK_MONOTONIC, &then);

	while (1) {
		ret = read(fd, &buf, 65536);
		if (!ret || ret == -1)
			break;
		rd += ret;
		count++;
	}

	ret = clock_gettime(CLOCK_MONOTONIC, &now); 
	usec_elapsed = (now.tv_sec - then.tv_sec) * 1000000L;
	usec_elapsed += (now.tv_nsec - then.tv_nsec) / 1000L;

	printf("read %d bytes in %d syscalls in %ldus!\n", rd, count, usec_elapsed);
	close(fd);
}

static void *do_nothing_forever(void *unused)
{
	while (1)
		sleep(60);

	return NULL;
}

int main(int args, char **argv)
{
	int i, ret, threads_to_spawn = 0;	
	pthread_t tmp;

	if (args != 1) {
		threads_to_spawn = atoi(argv[1]);
		printf("Spawning %d threads\n", threads_to_spawn);
	}

	for (i = 0; i < threads_to_spawn; i++) {
		ret = pthread_create(&tmp, NULL, do_nothing_forever, NULL);
		if (ret)
			printf("Thread %d failed to spawn?\n", i);
	}

	time_maps_read();
	return 0;
}

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14  0:20 [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files Calvin Owens
  2015-01-14  0:23 ` Calvin Owens
@ 2015-01-14 14:13 ` Rasmus Villemoes
  2015-01-14 14:37   ` Siddhesh Poyarekar
  2015-01-14 15:25 ` Kirill A. Shutemov
  2015-01-14 22:40 ` [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files Kirill A. Shutemov
  3 siblings, 1 reply; 80+ messages in thread
From: Rasmus Villemoes @ 2015-01-14 14:13 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman,
	Al Viro, Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, linux-kernel, kernel-team

On Wed, Jan 14 2015, Calvin Owens <calvinowens@fb.com> wrote:

> Commit b76437579d1344b6 ("procfs: mark thread stack correctly in
> proc/<pid>/maps") introduced logic to mark thread stacks with the
> "[stack:%d]" marker in /proc/<pid>/maps.
>
> This causes reading /proc/<pid>/maps to take O(N^2) time, where N is
> the number of threads sharing an address space, since each line of
> output requires iterating over the VMA list looking for ranges that
> correspond to the stack pointer in any task's register set. When
> dealing with highly-threaded Java applications, reading this file can
> take hours and trigger softlockup dumps.
>
> Eliminating the "[stack:%d]" marker is not a viable option since it's
> been there for some time, and I don't see a way to do the stack check
> more efficiently that wouldn't end up making the whole thing really
> ugly.

Just thinking out loud: Could one simply mark a VMA as being used for
stack during the clone call (is there room in vm_flags, or does
VM_GROWSDOWN already tell the whole story?), and then write the TID into
a new field in the VMA - I think one could make a union with vm_pgoff so
as not to enlarge the structure.

This would allow eliminating the loop over tasks in vm_is_stack.

Rasmus

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 14:13 ` Rasmus Villemoes
@ 2015-01-14 14:37   ` Siddhesh Poyarekar
  2015-01-14 14:53     ` Rasmus Villemoes
  0 siblings, 1 reply; 80+ messages in thread
From: Siddhesh Poyarekar @ 2015-01-14 14:37 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Calvin Owens, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, linux-kernel, kernel-team

On 14 January 2015 at 19:43, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
> Just thinking out loud: Could one simply mark a VMA as being used for
> stack during the clone call (is there room in vm_flags, or does
> VM_GROWSDOWN already tell the whole story?), and then write the TID into
> a new field in the VMA - I think one could make a union with vm_pgoff so
> as not to enlarge the structure.

vm_flags does not have space IIRC (that was my first approach at
implementing this) and VM_GROWSDOWN is not sufficient.  If we can make
a union with vm_pgoff like you say, we probably don't need a flag
value; a non-zero value could indicate that it is a thread stack.

One problem with caching the value on clone like this though is that
the stack could change due to a setcontext, but AFAICT we don't care
about that for the process stack either.

Siddhesh
-- 
http://siddhesh.in

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 14:37   ` Siddhesh Poyarekar
@ 2015-01-14 14:53     ` Rasmus Villemoes
  2015-01-14 21:03       ` Calvin Owens
  0 siblings, 1 reply; 80+ messages in thread
From: Rasmus Villemoes @ 2015-01-14 14:53 UTC (permalink / raw)
  To: Siddhesh Poyarekar
  Cc: Calvin Owens, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, linux-kernel, kernel-team

On Wed, Jan 14 2015, Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> wrote:

> On 14 January 2015 at 19:43, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
>> Just thinking out loud: Could one simply mark a VMA as being used for
>> stack during the clone call (is there room in vm_flags, or does
>> VM_GROWSDOWN already tell the whole story?), and then write the TID into
>> a new field in the VMA - I think one could make a union with vm_pgoff so
>> as not to enlarge the structure.
>
> vm_flags does not have space IIRC (that was my first approach at
> implementing this) and VM_GROWSDOWN is not sufficient.

Looking at include/linux/mm.h:

#define VM_GROWSDOWN    0x00000100      /* general info on the segment */
#define VM_PFNMAP       0x00000400      /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE    0x00000800      /* ETXTBSY on write attempts.. */

It would seem that 0x00000200 is available (unless defined and used
somewhere else).

> If we can make a union with vm_pgoff like you say, we probably don't
> need a flag value; a non-zero value could indicate that it is a thread
> stack.

Well, only when combined with checking vm_file for being NULL. One would
also need to ensure that vm_pgoff is 0 for any non-stack,
non-file-backed VMA. At which point it is somewhat ugly. 

> One problem with caching the value on clone like this though is that
> the stack could change due to a setcontext, but AFAICT we don't care
> about that for the process stack either.

If it is important, I guess one could update the info when a task calls
setcontext.

Rasmus

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14  0:20 [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files Calvin Owens
  2015-01-14  0:23 ` Calvin Owens
  2015-01-14 14:13 ` Rasmus Villemoes
@ 2015-01-14 15:25 ` Kirill A. Shutemov
  2015-01-14 15:33   ` Cyrill Gorcunov
  2015-01-14 22:40 ` [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files Kirill A. Shutemov
  3 siblings, 1 reply; 80+ messages in thread
From: Kirill A. Shutemov @ 2015-01-14 15:25 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman,
	Al Viro, Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, linux-kernel, kernel-team, Cyrill Gorcunov,
	Pavel Emelyanov

On Tue, Jan 13, 2015 at 04:20:29PM -0800, Calvin Owens wrote:
> Commit b76437579d1344b6 ("procfs: mark thread stack correctly in
> proc/<pid>/maps") introduced logic to mark thread stacks with the
> "[stack:%d]" marker in /proc/<pid>/maps.
> 
> This causes reading /proc/<pid>/maps to take O(N^2) time, where N is
> the number of threads sharing an address space, since each line of
> output requires iterating over the VMA list looking for ranges that
> correspond to the stack pointer in any task's register set. When
> dealing with highly-threaded Java applications, reading this file can
> take hours and trigger softlockup dumps.
> 
> Eliminating the "[stack:%d]" marker is not a viable option since it's
> been there for some time, and I don't see a way to do the stack check
> more efficiently that wouldn't end up making the whole thing really
> ugly.
> 
> The use case I'm specifically concerned with is the lsof command, so
> this patch adds an additional file, "mapped_files", that simply
> iterates over the VMAs associated with the task and outputs a
> newline-delimited list of the pathnames of the files associated with
> the VMAs, if any.
> 
> This gives lsof and suchlike a way to determine the pathnames of files
> mapped into a process without incurring the O(N^2) behavior of the
> maps file.

We already have /proc/PID/map_files/ directory which lists all mapped
files. Should we consider relaxing permission checking there and move it
outside CONFIG_CHECKPOINT_RESTORE instead?

Restriction to CAP_SYSADMIN for follow_link is undertansble, but why do we
restrict readdir and readlink?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 15:25 ` Kirill A. Shutemov
@ 2015-01-14 15:33   ` Cyrill Gorcunov
  2015-01-14 20:46     ` Calvin Owens
  0 siblings, 1 reply; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-14 15:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Calvin Owens, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov

On Wed, Jan 14, 2015 at 05:25:01PM +0200, Kirill A. Shutemov wrote:
...
> > 
> > This gives lsof and suchlike a way to determine the pathnames of files
> > mapped into a process without incurring the O(N^2) behavior of the
> > maps file.
> 
> We already have /proc/PID/map_files/ directory which lists all mapped
> files. Should we consider relaxing permission checking there and move it
> outside CONFIG_CHECKPOINT_RESTORE instead?
> 
> Restriction to CAP_SYSADMIN for follow_link is undertansble, but why do we
> restrict readdir and readlink?

We didn't think this functionality might be needed someone but us (criu camp),
so that the rule of thumb was CONFIG_CHECKPOINT_RESTORE + CAP_SYSADMIN, until
otherwise strictly needed. So I think now we can relax security rules a bit
and allow to readdir and such for owners.

	Cyrill

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 15:33   ` Cyrill Gorcunov
@ 2015-01-14 20:46     ` Calvin Owens
  2015-01-14 21:16       ` Cyrill Gorcunov
  0 siblings, 1 reply; 80+ messages in thread
From: Calvin Owens @ 2015-01-14 20:46 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kirill A. Shutemov, Andrew Morton, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov

On Wednesday 01/14 at 18:33 +0300, Cyrill Gorcunov wrote:
> On Wed, Jan 14, 2015 at 05:25:01PM +0200, Kirill A. Shutemov wrote:
> ...
> > > 
> > > This gives lsof and suchlike a way to determine the pathnames of files
> > > mapped into a process without incurring the O(N^2) behavior of the
> > > maps file.
> > 
> > We already have /proc/PID/map_files/ directory which lists all mapped
> > files. Should we consider relaxing permission checking there and move it
> > outside CONFIG_CHECKPOINT_RESTORE instead?
> > 
> > Restriction to CAP_SYSADMIN for follow_link is undertansble, but why do we
> > restrict readdir and readlink?
> 
> We didn't think this functionality might be needed someone but us (criu camp),
> so that the rule of thumb was CONFIG_CHECKPOINT_RESTORE + CAP_SYSADMIN, until
> otherwise strictly needed. So I think now we can relax security rules a bit
> and allow to readdir and such for owners.

Ah, I feel silly for missing that. I'll send a patch to move map_files
out from behind CONFIG_CHECKPOINT_RESTORE and change the permissions.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 14:53     ` Rasmus Villemoes
@ 2015-01-14 21:03       ` Calvin Owens
  2015-01-14 22:45         ` Andrew Morton
  0 siblings, 1 reply; 80+ messages in thread
From: Calvin Owens @ 2015-01-14 21:03 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Siddhesh Poyarekar, Andrew Morton, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, linux-kernel, kernel-team

On Wednesday 01/14 at 15:53 +0100, Rasmus Villemoes wrote:
> On Wed, Jan 14 2015, Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> wrote:
> 
> > On 14 January 2015 at 19:43, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
> >> Just thinking out loud: Could one simply mark a VMA as being used for
> >> stack during the clone call (is there room in vm_flags, or does
> >> VM_GROWSDOWN already tell the whole story?), and then write the TID into
> >> a new field in the VMA - I think one could make a union with vm_pgoff so
> >> as not to enlarge the structure.
> >
> > vm_flags does not have space IIRC (that was my first approach at
> > implementing this) and VM_GROWSDOWN is not sufficient.
> 
> Looking at include/linux/mm.h:
> 
> #define VM_GROWSDOWN    0x00000100      /* general info on the segment */
> #define VM_PFNMAP       0x00000400      /* Page-ranges managed without "struct page", just pure PFN */
> #define VM_DENYWRITE    0x00000800      /* ETXTBSY on write attempts.. */
> 
> It would seem that 0x00000200 is available (unless defined and used
> somewhere else).
> 
> > If we can make a union with vm_pgoff like you say, we probably don't
> > need a flag value; a non-zero value could indicate that it is a thread
> > stack.
> 
> Well, only when combined with checking vm_file for being NULL. One would
> also need to ensure that vm_pgoff is 0 for any non-stack,
> non-file-backed VMA. At which point it is somewhat ugly. 
> 
> > One problem with caching the value on clone like this though is that
> > the stack could change due to a setcontext, but AFAICT we don't care
> > about that for the process stack either.
> 
> If it is important, I guess one could update the info when a task calls
> setcontext.

If I understand the current behavior, the "[stack]" marker will get put
next to *any* mapping that encompasses the current value in the task's
%sp, regardless of how the mapping was created or ucontext stuff. If
you use flags on the VMA structs things could potentially be marked as
stacks even though %sp points somewhere else.

It's probable that nobody cares (you'd obviously have to be doing crazy
things to be pointing %sp at arbitrary places), but that's why I was
hesitant to mess with it.

Thanks,
Calvin
 
> Rasmus

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 20:46     ` Calvin Owens
@ 2015-01-14 21:16       ` Cyrill Gorcunov
  2015-01-22  2:45         ` [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable Calvin Owens
  0 siblings, 1 reply; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-14 21:16 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Kirill A. Shutemov, Andrew Morton, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov

On Wed, Jan 14, 2015 at 12:46:53PM -0800, Calvin Owens wrote:
> > > 
> > > Restriction to CAP_SYSADMIN for follow_link is undertansble, but why do we
> > > restrict readdir and readlink?
> > 
> > We didn't think this functionality might be needed someone but us (criu camp),
> > so that the rule of thumb was CONFIG_CHECKPOINT_RESTORE + CAP_SYSADMIN, until
> > otherwise strictly needed. So I think now we can relax security rules a bit
> > and allow to readdir and such for owners.
> 
> Ah, I feel silly for missing that. I'll send a patch to move map_files
> out from behind CONFIG_CHECKPOINT_RESTORE and change the permissions.

Sure

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14  0:20 [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files Calvin Owens
                   ` (2 preceding siblings ...)
  2015-01-14 15:25 ` Kirill A. Shutemov
@ 2015-01-14 22:40 ` Kirill A. Shutemov
  3 siblings, 0 replies; 80+ messages in thread
From: Kirill A. Shutemov @ 2015-01-14 22:40 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman,
	Al Viro, Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, linux-kernel, kernel-team

On Tue, Jan 13, 2015 at 04:20:29PM -0800, Calvin Owens wrote:
> Commit b76437579d1344b6 ("procfs: mark thread stack correctly in
> proc/<pid>/maps") introduced logic to mark thread stacks with the
> "[stack:%d]" marker in /proc/<pid>/maps.
> 
> This causes reading /proc/<pid>/maps to take O(N^2) time, where N is
> the number of threads sharing an address space, since each line of
> output requires iterating over the VMA list looking for ranges that
> correspond to the stack pointer in any task's register set. When
> dealing with highly-threaded Java applications, reading this file can
> take hours and trigger softlockup dumps.
> 
> Eliminating the "[stack:%d]" marker is not a viable option since it's
> been there for some time, and I don't see a way to do the stack check
> more efficiently that wouldn't end up making the whole thing really
> ugly.

Can we find stack for threads once on seq_operations::start() and avoid
for_each_thread() on seq_operations::show() for each stack vma?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 21:03       ` Calvin Owens
@ 2015-01-14 22:45         ` Andrew Morton
  2015-01-14 23:51           ` Rasmus Villemoes
  0 siblings, 1 reply; 80+ messages in thread
From: Andrew Morton @ 2015-01-14 22:45 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Rasmus Villemoes, Siddhesh Poyarekar, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, linux-kernel, kernel-team

On Wed, 14 Jan 2015 13:03:26 -0800 Calvin Owens <calvinowens@fb.com> wrote:

> > Well, only when combined with checking vm_file for being NULL. One would
> > also need to ensure that vm_pgoff is 0 for any non-stack,
> > non-file-backed VMA. At which point it is somewhat ugly. 
> > 
> > > One problem with caching the value on clone like this though is that
> > > the stack could change due to a setcontext, but AFAICT we don't care
> > > about that for the process stack either.
> > 
> > If it is important, I guess one could update the info when a task calls
> > setcontext.
> 
> If I understand the current behavior, the "[stack]" marker will get put
> next to *any* mapping that encompasses the current value in the task's
> %sp, regardless of how the mapping was created or ucontext stuff. If
> you use flags on the VMA structs things could potentially be marked as
> stacks even though %sp points somewhere else.
> 
> It's probable that nobody cares (you'd obviously have to be doing crazy
> things to be pointing %sp at arbitrary places), but that's why I was
> hesitant to mess with it.

Fixing the N^2 search would of course be much better than adding a new
proc file to sidestep it.

Could we do something like refreshing the new vma.vm_flags:VM_IS_STACK
on each thread at the time when /proc/PID/maps is opened?  So do a walk
of the threads, use each thread's sp to hunt down the thread's stack's
vma, then set VM_IS_STACK and fill in the new vma.stack_tid field?

There are still several flags unused in vma.vm_flags btw.

I'm not sure that we can repurpose vm_pgoff (or vm_private_data) for
this: a badly behaved thread could make its sp point at a random vma
then trick the kernel into scribbling on that vma's vm_proff?  Adding a
new field to the vma wouldn't kill us, I guess.  That would remove the
need for a VM_IS_STACK.



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 22:45         ` Andrew Morton
@ 2015-01-14 23:51           ` Rasmus Villemoes
  2015-01-16  1:15             ` Andrew Morton
  0 siblings, 1 reply; 80+ messages in thread
From: Rasmus Villemoes @ 2015-01-14 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Calvin Owens, Siddhesh Poyarekar, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, linux-kernel, kernel-team

On Wed, Jan 14 2015, Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 14 Jan 2015 13:03:26 -0800 Calvin Owens <calvinowens@fb.com> wrote:
>> 
>> If I understand the current behavior, the "[stack]" marker will get put
>> next to *any* mapping that encompasses the current value in the task's
>> %sp, regardless of how the mapping was created or ucontext stuff. If
>> you use flags on the VMA structs things could potentially be marked as
>> stacks even though %sp points somewhere else.
>> 
>> It's probable that nobody cares (you'd obviously have to be doing crazy
>> things to be pointing %sp at arbitrary places), but that's why I was
>> hesitant to mess with it.
>
> Fixing the N^2 search would of course be much better than adding a new
> proc file to sidestep it.
>
> Could we do something like refreshing the new vma.vm_flags:VM_IS_STACK
> on each thread at the time when /proc/PID/maps is opened?  So do a walk
> of the threads, use each thread's sp to hunt down the thread's stack's
> vma, then set VM_IS_STACK and fill in the new vma.stack_tid field?

So this would be roughly #tasks*log(#vmas) + #vmas. Sounds
good. Especially since all the work will be done by the reader, so
there's no extra bookkeeping to do in sys_clone etc. Concurrent readers
could influence what each other end up seeing, but most of the time the
update will be idempotent, and the information may be stale anyway by
the time the reader has a chance to process it.

> There are still several flags unused in vma.vm_flags btw.
>
> I'm not sure that we can repurpose vm_pgoff (or vm_private_data) for
> this: a badly behaved thread could make its sp point at a random vma
> then trick the kernel into scribbling on that vma's vm_proff?

Well, we could still check vm_file for being NULL before writing to
vm_pgoff/vm_stack_tid. 

> Adding a new field to the vma wouldn't kill us, I guess.  That would
> remove the need for a VM_IS_STACK.

Either way, it seems that that decision can be changed later.

Rasmus

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-14 23:51           ` Rasmus Villemoes
@ 2015-01-16  1:15             ` Andrew Morton
  2015-01-16 11:00               ` Kirill A. Shutemov
  0 siblings, 1 reply; 80+ messages in thread
From: Andrew Morton @ 2015-01-16  1:15 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Calvin Owens, Siddhesh Poyarekar, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, linux-kernel, kernel-team

On Thu, 15 Jan 2015 00:51:50 +0100 Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> > There are still several flags unused in vma.vm_flags btw.
> >
> > I'm not sure that we can repurpose vm_pgoff (or vm_private_data) for
> > this: a badly behaved thread could make its sp point at a random vma
> > then trick the kernel into scribbling on that vma's vm_proff?
> 
> Well, we could still check vm_file for being NULL before writing to
> vm_pgoff/vm_stack_tid. 

Yes, I guess that would work.  We'd need to check that nobody else
is already playing similar games with vm_pgoff.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files
  2015-01-16  1:15             ` Andrew Morton
@ 2015-01-16 11:00               ` Kirill A. Shutemov
  0 siblings, 0 replies; 80+ messages in thread
From: Kirill A. Shutemov @ 2015-01-16 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rasmus Villemoes, Calvin Owens, Siddhesh Poyarekar,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely, linux-kernel,
	kernel-team

On Thu, Jan 15, 2015 at 05:15:43PM -0800, Andrew Morton wrote:
> On Thu, 15 Jan 2015 00:51:50 +0100 Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
> 
> > > There are still several flags unused in vma.vm_flags btw.
> > >
> > > I'm not sure that we can repurpose vm_pgoff (or vm_private_data) for
> > > this: a badly behaved thread could make its sp point at a random vma
> > > then trick the kernel into scribbling on that vma's vm_proff?
> > 
> > Well, we could still check vm_file for being NULL before writing to
> > vm_pgoff/vm_stack_tid. 
> 
> Yes, I guess that would work.  We'd need to check that nobody else
> is already playing similar games with vm_pgoff.

Well, we do use ->vm_pgoff in anonymous VMAs. For rmap in particular --
vma_address().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-14 21:16       ` Cyrill Gorcunov
@ 2015-01-22  2:45         ` Calvin Owens
  2015-01-22  7:16           ` Cyrill Gorcunov
                             ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-22  2:45 UTC (permalink / raw)
  To: Cyrill Gorcunov, Kirill A. Shutemov
  Cc: Andrew Morton, Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman,
	Al Viro, Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, linux-kernel, kernel-team, Pavel Emelyanov

Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
is very useful for enumerating the files mapped into a process when
the more verbose information in /proc/<pid>/maps is not needed.

This patch moves the folder out from behind CHECKPOINT_RESTORE, and
removes the CAP_SYS_ADMIN restrictions. To avoid exposing files to
processes for whom they may not be visible, a follow_link() stub is
added to the inode_operations struct attached to the symlinks that
prevent them from being followed without CAP_SYS_ADMIN.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
 fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..7d48003 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1632,8 +1632,6 @@ end_instantiate:
 	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-
 /*
  * dname_to_vma_addr - maps a dentry name into two unsigned longs
  * which represent vma start and end addresses.
@@ -1660,11 +1658,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	if (!capable(CAP_SYS_ADMIN)) {
-		status = -EPERM;
-		goto out_notask;
-	}
-
 	inode = dentry->d_inode;
 	task = get_proc_task(inode);
 	if (!task)
@@ -1753,6 +1746,28 @@ struct map_files_info {
 	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
 };
 
+/*
+ * Allowing any user to follow the symlinks in /proc/<pid>/map_files/ could
+ * allow processes to access files that should not be visible to them, so
+ * restrict follow_link() to CAP_SYS_ADMIN for these files.
+ */
+static void *proc_map_files_follow_link(struct dentry *d, struct nameidata *n)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	return proc_pid_follow_link(d, n);
+}
+
+/*
+ * Identical to proc_pid_link_inode_operations except for follow_link()
+ */
+static const struct inode_operations proc_map_files_link_inode_operations = {
+	.readlink	= proc_pid_readlink,
+	.follow_link	= proc_map_files_follow_link,
+	.setattr	= proc_setattr,
+};
+
 static int
 proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 			   struct task_struct *task, const void *ptr)
@@ -1768,7 +1783,7 @@ proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 	ei = PROC_I(inode);
 	ei->op.proc_get_link = proc_map_files_get_link;
 
-	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_op = &proc_map_files_link_inode_operations;
 	inode->i_size = 64;
 	inode->i_mode = S_IFLNK;
 
@@ -1792,10 +1807,6 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	int result;
 	struct mm_struct *mm;
 
-	result = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	result = -ENOENT;
 	task = get_proc_task(dir);
 	if (!task)
@@ -1849,10 +1860,6 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	struct map_files_info *p;
 	int ret;
 
-	ret = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	ret = -ENOENT;
 	task = get_proc_task(file_inode(file));
 	if (!task)
@@ -2040,7 +2047,6 @@ static const struct file_operations proc_timers_operations = {
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
-#endif /* CONFIG_CHECKPOINT_RESTORE */
 
 static int proc_pident_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
@@ -2537,9 +2543,7 @@ static const struct inode_operations proc_task_inode_operations;
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-#ifdef CONFIG_CHECKPOINT_RESTORE
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-#endif
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-22  2:45         ` [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable Calvin Owens
@ 2015-01-22  7:16           ` Cyrill Gorcunov
  2015-01-22 11:02           ` Kirill A. Shutemov
  2015-01-24  3:15           ` [RFC][PATCH v2] " Calvin Owens
  2 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-22  7:16 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Kirill A. Shutemov, Andrew Morton, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov

On Wed, Jan 21, 2015 at 06:45:54PM -0800, Calvin Owens wrote:
> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> is very useful for enumerating the files mapped into a process when
> the more verbose information in /proc/<pid>/maps is not needed.
> 
> This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> removes the CAP_SYS_ADMIN restrictions. To avoid exposing files to
> processes for whom they may not be visible, a follow_link() stub is
> added to the inode_operations struct attached to the symlinks that
> prevent them from being followed without CAP_SYS_ADMIN.
> 
> Signed-off-by: Calvin Owens <calvinowens@fb.com>

Looks good to me, thanks.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-22  2:45         ` [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable Calvin Owens
  2015-01-22  7:16           ` Cyrill Gorcunov
@ 2015-01-22 11:02           ` Kirill A. Shutemov
  2015-01-22 21:00             ` Calvin Owens
  2015-01-24  3:15           ` [RFC][PATCH v2] " Calvin Owens
  2 siblings, 1 reply; 80+ messages in thread
From: Kirill A. Shutemov @ 2015-01-22 11:02 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Cyrill Gorcunov, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov

On Wed, Jan 21, 2015 at 06:45:54PM -0800, Calvin Owens wrote:
> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> is very useful for enumerating the files mapped into a process when
> the more verbose information in /proc/<pid>/maps is not needed.
> 
> This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> removes the CAP_SYS_ADMIN restrictions. To avoid exposing files to
> processes for whom they may not be visible, a follow_link() stub is
> added to the inode_operations struct attached to the symlinks that
> prevent them from being followed without CAP_SYS_ADMIN.
> 
> Signed-off-by: Calvin Owens <calvinowens@fb.com>
> ---
>  fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
>  1 file changed, 23 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 3f3d7ae..7d48003 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1632,8 +1632,6 @@ end_instantiate:
>  	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
>  }
>  
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> -
>  /*
>   * dname_to_vma_addr - maps a dentry name into two unsigned longs
>   * which represent vma start and end addresses.
> @@ -1660,11 +1658,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
>  	if (flags & LOOKUP_RCU)
>  		return -ECHILD;
>  
> -	if (!capable(CAP_SYS_ADMIN)) {
> -		status = -EPERM;
> -		goto out_notask;
> -	}
> -
>  	inode = dentry->d_inode;
>  	task = get_proc_task(inode);
>  	if (!task)
> @@ -1753,6 +1746,28 @@ struct map_files_info {
>  	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
>  };
>  
> +/*
> + * Allowing any user to follow the symlinks in /proc/<pid>/map_files/ could
> + * allow processes to access files that should not be visible to them, so
> + * restrict follow_link() to CAP_SYS_ADMIN for these files.
> + */
> +static void *proc_map_files_follow_link(struct dentry *d, struct nameidata *n)
> +{
> +	if (!capable(CAP_SYS_ADMIN))
> +		return ERR_PTR(-EPERM);
> +
> +	return proc_pid_follow_link(d, n);
> +}

I have thought a bit more about this and not sure it's reasonable to
limit it to CAP_SYS_ADMIN. What scenario are we protecting from?

Initially, I thought about something like this: privileged process opens a
file, map part of it, closes the file and drop privileges with hope to
limit further access to mapped window of the file. But I don't see what
would stop the unprivileged process from accessing rest of the file using
mremap(2). And if a process can do this, anybody who can ptrace(2) the
process can do this.

Am I missing something?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-22 11:02           ` Kirill A. Shutemov
@ 2015-01-22 21:00             ` Calvin Owens
  2015-01-22 21:27               ` Kirill A. Shutemov
  0 siblings, 1 reply; 80+ messages in thread
From: Calvin Owens @ 2015-01-22 21:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Cyrill Gorcunov, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov

On Thursday 01/22 at 13:02 +0200, Kirill A. Shutemov wrote:
> On Wed, Jan 21, 2015 at 06:45:54PM -0800, Calvin Owens wrote:
> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > is very useful for enumerating the files mapped into a process when
> > the more verbose information in /proc/<pid>/maps is not needed.
> > 
> > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > removes the CAP_SYS_ADMIN restrictions. To avoid exposing files to
> > processes for whom they may not be visible, a follow_link() stub is
> > added to the inode_operations struct attached to the symlinks that
> > prevent them from being followed without CAP_SYS_ADMIN.
> > 
> > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> > ---
> >  fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
> >  1 file changed, 23 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 3f3d7ae..7d48003 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1632,8 +1632,6 @@ end_instantiate:
> >  	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
> >  }
> >  
> > -#ifdef CONFIG_CHECKPOINT_RESTORE
> > -
> >  /*
> >   * dname_to_vma_addr - maps a dentry name into two unsigned longs
> >   * which represent vma start and end addresses.
> > @@ -1660,11 +1658,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
> >  	if (flags & LOOKUP_RCU)
> >  		return -ECHILD;
> >  
> > -	if (!capable(CAP_SYS_ADMIN)) {
> > -		status = -EPERM;
> > -		goto out_notask;
> > -	}
> > -
> >  	inode = dentry->d_inode;
> >  	task = get_proc_task(inode);
> >  	if (!task)
> > @@ -1753,6 +1746,28 @@ struct map_files_info {
> >  	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
> >  };
> >  
> > +/*
> > + * Allowing any user to follow the symlinks in /proc/<pid>/map_files/ could
> > + * allow processes to access files that should not be visible to them, so
> > + * restrict follow_link() to CAP_SYS_ADMIN for these files.
> > + */
> > +static void *proc_map_files_follow_link(struct dentry *d, struct nameidata *n)
> > +{
> > +	if (!capable(CAP_SYS_ADMIN))
> > +		return ERR_PTR(-EPERM);
> > +
> > +	return proc_pid_follow_link(d, n);
> > +}
> 
> I have thought a bit more about this and not sure it's reasonable to
> limit it to CAP_SYS_ADMIN. What scenario are we protecting from?
> 
> Initially, I thought about something like this: privileged process opens a
> file, map part of it, closes the file and drop privileges with hope to
> limit further access to mapped window of the file. But I don't see what
> would stop the unprivileged process from accessing rest of the file using
> mremap(2). And if a process can do this, anybody who can ptrace(2) the
> process can do this.
> 
> Am I missing something?

The specific case I was thinking of is a process in a chroot with a
mounted /proc inside of it: if a process inside the chroot has the same
UID as a process outside of it, the chroot'ed process could follow the
symlinks in map_files/ and poke files it can't actually see, right?

I don't personally care about that use case, but it seemed like
something that might surprise somebody.
 
Calvin
 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-22 21:00             ` Calvin Owens
@ 2015-01-22 21:27               ` Kirill A. Shutemov
  2015-01-23  5:52                 ` Calvin Owens
  0 siblings, 1 reply; 80+ messages in thread
From: Kirill A. Shutemov @ 2015-01-22 21:27 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Cyrill Gorcunov, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov

On Thu, Jan 22, 2015 at 01:00:25PM -0800, Calvin Owens wrote:
> On Thursday 01/22 at 13:02 +0200, Kirill A. Shutemov wrote:
> > On Wed, Jan 21, 2015 at 06:45:54PM -0800, Calvin Owens wrote:
> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > > is very useful for enumerating the files mapped into a process when
> > > the more verbose information in /proc/<pid>/maps is not needed.
> > > 
> > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > > removes the CAP_SYS_ADMIN restrictions. To avoid exposing files to
> > > processes for whom they may not be visible, a follow_link() stub is
> > > added to the inode_operations struct attached to the symlinks that
> > > prevent them from being followed without CAP_SYS_ADMIN.
> > > 
> > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> > > ---
> > >  fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
> > >  1 file changed, 23 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > index 3f3d7ae..7d48003 100644
> > > --- a/fs/proc/base.c
> > > +++ b/fs/proc/base.c
> > > @@ -1632,8 +1632,6 @@ end_instantiate:
> > >  	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
> > >  }
> > >  
> > > -#ifdef CONFIG_CHECKPOINT_RESTORE
> > > -
> > >  /*
> > >   * dname_to_vma_addr - maps a dentry name into two unsigned longs
> > >   * which represent vma start and end addresses.
> > > @@ -1660,11 +1658,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
> > >  	if (flags & LOOKUP_RCU)
> > >  		return -ECHILD;
> > >  
> > > -	if (!capable(CAP_SYS_ADMIN)) {
> > > -		status = -EPERM;
> > > -		goto out_notask;
> > > -	}
> > > -
> > >  	inode = dentry->d_inode;
> > >  	task = get_proc_task(inode);
> > >  	if (!task)
> > > @@ -1753,6 +1746,28 @@ struct map_files_info {
> > >  	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
> > >  };
> > >  
> > > +/*
> > > + * Allowing any user to follow the symlinks in /proc/<pid>/map_files/ could
> > > + * allow processes to access files that should not be visible to them, so
> > > + * restrict follow_link() to CAP_SYS_ADMIN for these files.
> > > + */
> > > +static void *proc_map_files_follow_link(struct dentry *d, struct nameidata *n)
> > > +{
> > > +	if (!capable(CAP_SYS_ADMIN))
> > > +		return ERR_PTR(-EPERM);
> > > +
> > > +	return proc_pid_follow_link(d, n);
> > > +}
> > 
> > I have thought a bit more about this and not sure it's reasonable to
> > limit it to CAP_SYS_ADMIN. What scenario are we protecting from?
> > 
> > Initially, I thought about something like this: privileged process opens a
> > file, map part of it, closes the file and drop privileges with hope to
> > limit further access to mapped window of the file. But I don't see what
> > would stop the unprivileged process from accessing rest of the file using
> > mremap(2). And if a process can do this, anybody who can ptrace(2) the
> > process can do this.
> > 
> > Am I missing something?
> 
> The specific case I was thinking of is a process in a chroot with a
> mounted /proc inside of it: if a process inside the chroot has the same
> UID as a process outside of it, the chroot'ed process could follow the
> symlinks in map_files/ and poke files it can't actually see, right?

It depends on how you define "poke". If you mean touch content of the
file, then, well, you can do it now. You cannot do anything which requires
file descriptor -- open(), ftrancate(), etc.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-22 21:27               ` Kirill A. Shutemov
@ 2015-01-23  5:52                 ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-23  5:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Cyrill Gorcunov, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov

On Thursday 01/22 at 23:27 +0200, Kirill A. Shutemov wrote:
> On Thu, Jan 22, 2015 at 01:00:25PM -0800, Calvin Owens wrote:
> > On Thursday 01/22 at 13:02 +0200, Kirill A. Shutemov wrote:
> > > On Wed, Jan 21, 2015 at 06:45:54PM -0800, Calvin Owens wrote:
> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > > > is very useful for enumerating the files mapped into a process when
> > > > the more verbose information in /proc/<pid>/maps is not needed.
> > > > 
> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > > > removes the CAP_SYS_ADMIN restrictions. To avoid exposing files to
> > > > processes for whom they may not be visible, a follow_link() stub is
> > > > added to the inode_operations struct attached to the symlinks that
> > > > prevent them from being followed without CAP_SYS_ADMIN.
> > > > 
> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> > > > ---
> > > >  fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
> > > >  1 file changed, 23 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > > index 3f3d7ae..7d48003 100644
> > > > --- a/fs/proc/base.c
> > > > +++ b/fs/proc/base.c
> > > > @@ -1632,8 +1632,6 @@ end_instantiate:
> > > >  	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
> > > >  }
> > > >  
> > > > -#ifdef CONFIG_CHECKPOINT_RESTORE
> > > > -
> > > >  /*
> > > >   * dname_to_vma_addr - maps a dentry name into two unsigned longs
> > > >   * which represent vma start and end addresses.
> > > > @@ -1660,11 +1658,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
> > > >  	if (flags & LOOKUP_RCU)
> > > >  		return -ECHILD;
> > > >  
> > > > -	if (!capable(CAP_SYS_ADMIN)) {
> > > > -		status = -EPERM;
> > > > -		goto out_notask;
> > > > -	}
> > > > -
> > > >  	inode = dentry->d_inode;
> > > >  	task = get_proc_task(inode);
> > > >  	if (!task)
> > > > @@ -1753,6 +1746,28 @@ struct map_files_info {
> > > >  	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
> > > >  };
> > > >  
> > > > +/*
> > > > + * Allowing any user to follow the symlinks in /proc/<pid>/map_files/ could
> > > > + * allow processes to access files that should not be visible to them, so
> > > > + * restrict follow_link() to CAP_SYS_ADMIN for these files.
> > > > + */
> > > > +static void *proc_map_files_follow_link(struct dentry *d, struct nameidata *n)
> > > > +{
> > > > +	if (!capable(CAP_SYS_ADMIN))
> > > > +		return ERR_PTR(-EPERM);
> > > > +
> > > > +	return proc_pid_follow_link(d, n);
> > > > +}
> > > 
> > > I have thought a bit more about this and not sure it's reasonable to
> > > limit it to CAP_SYS_ADMIN. What scenario are we protecting from?
> > > 
> > > Initially, I thought about something like this: privileged process opens a
> > > file, map part of it, closes the file and drop privileges with hope to
> > > limit further access to mapped window of the file. But I don't see what
> > > would stop the unprivileged process from accessing rest of the file using
> > > mremap(2). And if a process can do this, anybody who can ptrace(2) the
> > > process can do this.
> > > 
> > > Am I missing something?
> > 
> > The specific case I was thinking of is a process in a chroot with a
> > mounted /proc inside of it: if a process inside the chroot has the same
> > UID as a process outside of it, the chroot'ed process could follow the
> > symlinks in map_files/ and poke files it can't actually see, right?
> 
> It depends on how you define "poke". If you mean touch content of the
> file, then, well, you can do it now. You cannot do anything which requires
> file descriptor -- open(), ftrancate(), etc.

Ah okay, I didn't realize you couldn't get the file descriptor. I wrote
a quick test case, you get -EACCES on open() in my chroot scenario.

I'll resend without the CAP_SYS_ADMIN check.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-22  2:45         ` [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable Calvin Owens
  2015-01-22  7:16           ` Cyrill Gorcunov
  2015-01-22 11:02           ` Kirill A. Shutemov
@ 2015-01-24  3:15           ` Calvin Owens
  2015-01-26 12:47             ` Kirill A. Shutemov
  2015-02-12  2:29             ` [RFC][PATCH v3] " Calvin Owens
  2 siblings, 2 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-24  3:15 UTC (permalink / raw)
  To: Cyrill Gorcunov, Kirill A. Shutemov
  Cc: Andrew Morton, Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman,
	Al Viro, Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, linux-kernel, kernel-team, Pavel Emelyanov

Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
is very useful for enumerating the files mapped into a process when
the more verbose information in /proc/<pid>/maps is not needed.

This patch moves the folder out from behind CHECKPOINT_RESTORE, and
removes the CAP_SYS_ADMIN restrictions. Following the links requires
the ability to ptrace the process in question, so this doesn't allow
an attacker to do anything they couldn't already do before.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
Changes in v2: 	Removed the follow_link() stub that returned -EPERM if
		the caller didn't have CAP_SYS_ADMIN, since the caller
		in my chroot() scenario gets -EACCES anyway.

 fs/proc/base.c | 18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..67b15ac 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1632,8 +1632,6 @@ end_instantiate:
 	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-
 /*
  * dname_to_vma_addr - maps a dentry name into two unsigned longs
  * which represent vma start and end addresses.
@@ -1660,11 +1658,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	if (!capable(CAP_SYS_ADMIN)) {
-		status = -EPERM;
-		goto out_notask;
-	}
-
 	inode = dentry->d_inode;
 	task = get_proc_task(inode);
 	if (!task)
@@ -1792,10 +1785,6 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	int result;
 	struct mm_struct *mm;
 
-	result = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	result = -ENOENT;
 	task = get_proc_task(dir);
 	if (!task)
@@ -1849,10 +1838,6 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	struct map_files_info *p;
 	int ret;
 
-	ret = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	ret = -ENOENT;
 	task = get_proc_task(file_inode(file));
 	if (!task)
@@ -2040,7 +2025,6 @@ static const struct file_operations proc_timers_operations = {
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
-#endif /* CONFIG_CHECKPOINT_RESTORE */
 
 static int proc_pident_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
@@ -2537,9 +2521,7 @@ static const struct inode_operations proc_task_inode_operations;
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-#ifdef CONFIG_CHECKPOINT_RESTORE
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-#endif
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-24  3:15           ` [RFC][PATCH v2] " Calvin Owens
@ 2015-01-26 12:47             ` Kirill A. Shutemov
  2015-01-26 21:00                 ` Cyrill Gorcunov
  2015-02-12  2:29             ` [RFC][PATCH v3] " Calvin Owens
  1 sibling, 1 reply; 80+ messages in thread
From: Kirill A. Shutemov @ 2015-01-26 12:47 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Cyrill Gorcunov, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov, linux-api

On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> is very useful for enumerating the files mapped into a process when
> the more verbose information in /proc/<pid>/maps is not needed.
> 
> This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> removes the CAP_SYS_ADMIN restrictions. Following the links requires
> the ability to ptrace the process in question, so this doesn't allow
> an attacker to do anything they couldn't already do before.
> 
> Signed-off-by: Calvin Owens <calvinowens@fb.com>

Cc +linux-api@

> ---
> Changes in v2: 	Removed the follow_link() stub that returned -EPERM if
> 		the caller didn't have CAP_SYS_ADMIN, since the caller
> 		in my chroot() scenario gets -EACCES anyway.
> 
>  fs/proc/base.c | 18 ------------------
>  1 file changed, 18 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 3f3d7ae..67b15ac 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1632,8 +1632,6 @@ end_instantiate:
>  	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
>  }
>  
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> -
>  /*
>   * dname_to_vma_addr - maps a dentry name into two unsigned longs
>   * which represent vma start and end addresses.
> @@ -1660,11 +1658,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
>  	if (flags & LOOKUP_RCU)
>  		return -ECHILD;
>  
> -	if (!capable(CAP_SYS_ADMIN)) {
> -		status = -EPERM;
> -		goto out_notask;
> -	}
> -
>  	inode = dentry->d_inode;
>  	task = get_proc_task(inode);
>  	if (!task)
> @@ -1792,10 +1785,6 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
>  	int result;
>  	struct mm_struct *mm;
>  
> -	result = -EPERM;
> -	if (!capable(CAP_SYS_ADMIN))
> -		goto out;
> -
>  	result = -ENOENT;
>  	task = get_proc_task(dir);
>  	if (!task)
> @@ -1849,10 +1838,6 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
>  	struct map_files_info *p;
>  	int ret;
>  
> -	ret = -EPERM;
> -	if (!capable(CAP_SYS_ADMIN))
> -		goto out;
> -
>  	ret = -ENOENT;
>  	task = get_proc_task(file_inode(file));
>  	if (!task)
> @@ -2040,7 +2025,6 @@ static const struct file_operations proc_timers_operations = {
>  	.llseek		= seq_lseek,
>  	.release	= seq_release_private,
>  };
> -#endif /* CONFIG_CHECKPOINT_RESTORE */
>  
>  static int proc_pident_instantiate(struct inode *dir,
>  	struct dentry *dentry, struct task_struct *task, const void *ptr)
> @@ -2537,9 +2521,7 @@ static const struct inode_operations proc_task_inode_operations;
>  static const struct pid_entry tgid_base_stuff[] = {
>  	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
>  	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
> -#ifdef CONFIG_CHECKPOINT_RESTORE
>  	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
> -#endif
>  	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
>  	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
>  #ifdef CONFIG_NET
> -- 
> 1.8.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-26 21:00                 ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-26 21:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Calvin Owens, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov, linux-api

On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > is very useful for enumerating the files mapped into a process when
> > the more verbose information in /proc/<pid>/maps is not needed.
> > 
> > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> > the ability to ptrace the process in question, so this doesn't allow
> > an attacker to do anything they couldn't already do before.
> > 
> > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> 
> Cc +linux-api@

Looks good to me, thanks! Though I would really appreciate if someone
from security camp take a look as well.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-26 21:00                 ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-26 21:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Calvin Owens, Andrew Morton, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > is very useful for enumerating the files mapped into a process when
> > the more verbose information in /proc/<pid>/maps is not needed.
> > 
> > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> > the ability to ptrace the process in question, so this doesn't allow
> > an attacker to do anything they couldn't already do before.
> > 
> > Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> 
> Cc +linux-api@

Looks good to me, thanks! Though I would really appreciate if someone
from security camp take a look as well.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-26 21:00                 ` Cyrill Gorcunov
  (?)
@ 2015-01-26 23:43                 ` Andrew Morton
  2015-01-27  0:15                     ` Kees Cook
                                     ` (3 more replies)
  -1 siblings, 4 replies; 80+ messages in thread
From: Andrew Morton @ 2015-01-26 23:43 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov, linux-api, Kees Cook

On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > > is very useful for enumerating the files mapped into a process when
> > > the more verbose information in /proc/<pid>/maps is not needed.

This is the main (actually only) justification for the patch, and it it
far too thin.  What does "not needed" mean.  Why can't people just use
/proc/pid/maps?

> > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> > > the ability to ptrace the process in question, so this doesn't allow
> > > an attacker to do anything they couldn't already do before.
> > > 
> > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> > 
> > Cc +linux-api@
> 
> Looks good to me, thanks! Though I would really appreciate if someone
> from security camp take a look as well.

hm, who's that.  Kees comes to mind.

And reviewers' task would be a heck of a lot easier if they knew what
/proc/pid/map_files actually does.  This:

akpm3:/usr/src/25> grep -r map_files Documentation 
akpm3:/usr/src/25> 

does not help.

The 640708a2cff7f81 changelog says:

:     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
:     symlinks one for each mapping with file, the name of a symlink is
:     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
:     results in a file that point exactly to the same inode as them vma's one.
:     
:     For example the ls -l of some arbitrary /proc/<pid>/map_files/
:     
:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

afacit this info is also available in /proc/pid/maps, so things
shouldn't get worse if the /proc/pid/map_files permissions are at least
as restrictive as the /proc/pid/maps permissions.  Is that the case? 
(Please add to changelog).


There's one other problem here: we're assuming that the map_files
implementation doesn't have bugs.  If it does have bugs then relaxing
permissions like this will create new vulnerabilities.  And the
map_files implementation is surprisingly complex.  Is it bug-free?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  0:15                     ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2015-01-27  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Kirill A. Shutemov, Calvin Owens,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov,
	Linux API

On Mon, Jan 26, 2015 at 3:43 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>
>> On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
>> > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
>> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
>> > > is very useful for enumerating the files mapped into a process when
>> > > the more verbose information in /proc/<pid>/maps is not needed.
>
> This is the main (actually only) justification for the patch, and it it
> far too thin.  What does "not needed" mean.  Why can't people just use
> /proc/pid/maps?
>
>> > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>> > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
>> > > the ability to ptrace the process in question, so this doesn't allow
>> > > an attacker to do anything they couldn't already do before.
>> > >
>> > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
>> >
>> > Cc +linux-api@
>>
>> Looks good to me, thanks! Though I would really appreciate if someone
>> from security camp take a look as well.
>
> hm, who's that.  Kees comes to mind.
>
> And reviewers' task would be a heck of a lot easier if they knew what
> /proc/pid/map_files actually does.  This:
>
> akpm3:/usr/src/25> grep -r map_files Documentation

If akpm's comments weren't clear: this needs to be fixed. Everything
in /proc should appear in Documentation.

> akpm3:/usr/src/25>
>
> does not help.
>
> The 640708a2cff7f81 changelog says:
>
> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> :     symlinks one for each mapping with file, the name of a symlink is
> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> :     results in a file that point exactly to the same inode as them vma's one.
> :
> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> :
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

How is mmap offset represented in this output?

>
> afacit this info is also available in /proc/pid/maps, so things
> shouldn't get worse if the /proc/pid/map_files permissions are at least
> as restrictive as the /proc/pid/maps permissions.  Is that the case?
> (Please add to changelog).

Both maps and map_files uses ptrace_may_access (via mm_acces) with
PTRACE_MODE_READ, so I'm happy from a info leak perspective.

Are mount namespaces handled in this output?

> There's one other problem here: we're assuming that the map_files
> implementation doesn't have bugs.  If it does have bugs then relaxing
> permissions like this will create new vulnerabilities.  And the
> map_files implementation is surprisingly complex.  Is it bug-free?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  0:15                     ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2015-01-27  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Kirill A. Shutemov, Calvin Owens,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, Linux API

On Mon, Jan 26, 2015 at 3:43 PM, Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
>> > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
>> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
>> > > is very useful for enumerating the files mapped into a process when
>> > > the more verbose information in /proc/<pid>/maps is not needed.
>
> This is the main (actually only) justification for the patch, and it it
> far too thin.  What does "not needed" mean.  Why can't people just use
> /proc/pid/maps?
>
>> > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>> > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
>> > > the ability to ptrace the process in question, so this doesn't allow
>> > > an attacker to do anything they couldn't already do before.
>> > >
>> > > Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
>> >
>> > Cc +linux-api@
>>
>> Looks good to me, thanks! Though I would really appreciate if someone
>> from security camp take a look as well.
>
> hm, who's that.  Kees comes to mind.
>
> And reviewers' task would be a heck of a lot easier if they knew what
> /proc/pid/map_files actually does.  This:
>
> akpm3:/usr/src/25> grep -r map_files Documentation

If akpm's comments weren't clear: this needs to be fixed. Everything
in /proc should appear in Documentation.

> akpm3:/usr/src/25>
>
> does not help.
>
> The 640708a2cff7f81 changelog says:
>
> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> :     symlinks one for each mapping with file, the name of a symlink is
> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> :     results in a file that point exactly to the same inode as them vma's one.
> :
> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> :
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

How is mmap offset represented in this output?

>
> afacit this info is also available in /proc/pid/maps, so things
> shouldn't get worse if the /proc/pid/map_files permissions are at least
> as restrictive as the /proc/pid/maps permissions.  Is that the case?
> (Please add to changelog).

Both maps and map_files uses ptrace_may_access (via mm_acces) with
PTRACE_MODE_READ, so I'm happy from a info leak perspective.

Are mount namespaces handled in this output?

> There's one other problem here: we're assuming that the map_files
> implementation doesn't have bugs.  If it does have bugs then relaxing
> permissions like this will create new vulnerabilities.  And the
> map_files implementation is surprisingly complex.  Is it bug-free?

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  0:19                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 80+ messages in thread
From: Kirill A. Shutemov @ 2015-01-27  0:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Calvin Owens, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov, linux-api, Kees Cook

On Mon, Jan 26, 2015 at 03:43:46PM -0800, Andrew Morton wrote:
> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > > > is very useful for enumerating the files mapped into a process when
> > > > the more verbose information in /proc/<pid>/maps is not needed.
> 
> This is the main (actually only) justification for the patch, and it it
> far too thin.  What does "not needed" mean.  Why can't people just use
> /proc/pid/maps?
> 
> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> > > > the ability to ptrace the process in question, so this doesn't allow
> > > > an attacker to do anything they couldn't already do before.
> > > > 
> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> > > 
> > > Cc +linux-api@
> > 
> > Looks good to me, thanks! Though I would really appreciate if someone
> > from security camp take a look as well.
> 
> hm, who's that.  Kees comes to mind.
> 
> And reviewers' task would be a heck of a lot easier if they knew what
> /proc/pid/map_files actually does.  This:
> 
> akpm3:/usr/src/25> grep -r map_files Documentation 
> akpm3:/usr/src/25> 
> 
> does not help.
> 
> The 640708a2cff7f81 changelog says:
> 
> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> :     symlinks one for each mapping with file, the name of a symlink is
> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> :     results in a file that point exactly to the same inode as them vma's one.
> :     
> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> :     
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> afacit this info is also available in /proc/pid/maps, so things
> shouldn't get worse if the /proc/pid/map_files permissions are at least
> as restrictive as the /proc/pid/maps permissions.  Is that the case? 

Almost.

IIUC, before we haven't had a way to retrieve a file descriptor from
mapped file if it was closed and not accessible for direct re-open. Like
in chroot case or unlink after close.

I'm not sure what security implications this move has, if any. I don't see
anything obviously dangerous.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  0:19                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 80+ messages in thread
From: Kirill A. Shutemov @ 2015-01-27  0:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Calvin Owens, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, linux-api-u79uwXL29TY76Z2rM5mHXA, Kees Cook

On Mon, Jan 26, 2015 at 03:43:46PM -0800, Andrew Morton wrote:
> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > > > is very useful for enumerating the files mapped into a process when
> > > > the more verbose information in /proc/<pid>/maps is not needed.
> 
> This is the main (actually only) justification for the patch, and it it
> far too thin.  What does "not needed" mean.  Why can't people just use
> /proc/pid/maps?
> 
> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> > > > the ability to ptrace the process in question, so this doesn't allow
> > > > an attacker to do anything they couldn't already do before.
> > > > 
> > > > Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> > > 
> > > Cc +linux-api@
> > 
> > Looks good to me, thanks! Though I would really appreciate if someone
> > from security camp take a look as well.
> 
> hm, who's that.  Kees comes to mind.
> 
> And reviewers' task would be a heck of a lot easier if they knew what
> /proc/pid/map_files actually does.  This:
> 
> akpm3:/usr/src/25> grep -r map_files Documentation 
> akpm3:/usr/src/25> 
> 
> does not help.
> 
> The 640708a2cff7f81 changelog says:
> 
> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> :     symlinks one for each mapping with file, the name of a symlink is
> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> :     results in a file that point exactly to the same inode as them vma's one.
> :     
> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> :     
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> afacit this info is also available in /proc/pid/maps, so things
> shouldn't get worse if the /proc/pid/map_files permissions are at least
> as restrictive as the /proc/pid/maps permissions.  Is that the case? 

Almost.

IIUC, before we haven't had a way to retrieve a file descriptor from
mapped file if it was closed and not accessible for direct re-open. Like
in chroot case or unlink after close.

I'm not sure what security implications this move has, if any. I don't see
anything obviously dangerous.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  6:46                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-27  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov, linux-api, Kees Cook

On Mon, Jan 26, 2015 at 03:43:46PM -0800, Andrew Morton wrote:
> > 
> > Looks good to me, thanks! Though I would really appreciate if someone
> > from security camp take a look as well.
> 
> hm, who's that.  Kees comes to mind.

yup, I managed to forget CC him.

> 
> And reviewers' task would be a heck of a lot easier if they knew what
> /proc/pid/map_files actually does.  This:
> 
> akpm3:/usr/src/25> grep -r map_files Documentation 
> akpm3:/usr/src/25> 
> 
> does not help.

Sigh. Imagine, for some reason I though we've the docs for that
entry, probably i though that way because of many fdinfo snippets
i've putted into /proc docs. my bad, sorry. I'll try to prepare
docs today.

> The 640708a2cff7f81 changelog says:
> 
> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> :     symlinks one for each mapping with file, the name of a symlink is
> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> :     results in a file that point exactly to the same inode as them vma's one.
> :     
> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> :     
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> afacit this info is also available in /proc/pid/maps, so things
> shouldn't get worse if the /proc/pid/map_files permissions are at least
> as restrictive as the /proc/pid/maps permissions.  Is that the case? 
> (Please add to changelog).
> 
> There's one other problem here: we're assuming that the map_files
> implementation doesn't have bugs.  If it does have bugs then relaxing
> permissions like this will create new vulnerabilities.  And the
> map_files implementation is surprisingly complex.  Is it bug-free?

I didn't find any bugs in map-files (and we use it for long time already)
so I think it is safe.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  6:46                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-27  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, linux-api-u79uwXL29TY76Z2rM5mHXA, Kees Cook

On Mon, Jan 26, 2015 at 03:43:46PM -0800, Andrew Morton wrote:
> > 
> > Looks good to me, thanks! Though I would really appreciate if someone
> > from security camp take a look as well.
> 
> hm, who's that.  Kees comes to mind.

yup, I managed to forget CC him.

> 
> And reviewers' task would be a heck of a lot easier if they knew what
> /proc/pid/map_files actually does.  This:
> 
> akpm3:/usr/src/25> grep -r map_files Documentation 
> akpm3:/usr/src/25> 
> 
> does not help.

Sigh. Imagine, for some reason I though we've the docs for that
entry, probably i though that way because of many fdinfo snippets
i've putted into /proc docs. my bad, sorry. I'll try to prepare
docs today.

> The 640708a2cff7f81 changelog says:
> 
> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> :     symlinks one for each mapping with file, the name of a symlink is
> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> :     results in a file that point exactly to the same inode as them vma's one.
> :     
> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> :     
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> afacit this info is also available in /proc/pid/maps, so things
> shouldn't get worse if the /proc/pid/map_files permissions are at least
> as restrictive as the /proc/pid/maps permissions.  Is that the case? 
> (Please add to changelog).
> 
> There's one other problem here: we're assuming that the map_files
> implementation doesn't have bugs.  If it does have bugs then relaxing
> permissions like this will create new vulnerabilities.  And the
> map_files implementation is surprisingly complex.  Is it bug-free?

I didn't find any bugs in map-files (and we use it for long time already)
so I think it is safe.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-27  6:46                     ` Cyrill Gorcunov
  (?)
@ 2015-01-27  6:50                     ` Andrew Morton
  2015-01-27  7:23                         ` Cyrill Gorcunov
  -1 siblings, 1 reply; 80+ messages in thread
From: Andrew Morton @ 2015-01-27  6:50 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov, linux-api, Kees Cook

On Tue, 27 Jan 2015 09:46:47 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> > There's one other problem here: we're assuming that the map_files
> > implementation doesn't have bugs.  If it does have bugs then relaxing
> > permissions like this will create new vulnerabilities.  And the
> > map_files implementation is surprisingly complex.  Is it bug-free?
> 
> I didn't find any bugs in map-files (and we use it for long time already)
> so I think it is safe.

You've been using map_files the way it was supposed to be used so no,
any bugs won't show up.  What happens if you don your evil black hat
and use map_files in ways that weren't anticipated?  Attack it?


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  7:23                         ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-27  7:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar, linux-kernel, kernel-team,
	Pavel Emelyanov, linux-api, Kees Cook

On Mon, Jan 26, 2015 at 10:50:23PM -0800, Andrew Morton wrote:
> On Tue, 27 Jan 2015 09:46:47 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > > There's one other problem here: we're assuming that the map_files
> > > implementation doesn't have bugs.  If it does have bugs then relaxing
> > > permissions like this will create new vulnerabilities.  And the
> > > map_files implementation is surprisingly complex.  Is it bug-free?
> > 
> > I didn't find any bugs in map-files (and we use it for long time already)
> > so I think it is safe.
> 
> You've been using map_files the way it was supposed to be used so no,
> any bugs won't show up.  What happens if you don your evil black hat
> and use map_files in ways that weren't anticipated?  Attack it?

Hard to say, Andrew. If I found a way to exploit this feature for
bad purpose for sure I would patch it out. At the moment I don't
see any. Touching another process memory via file descriptor
allows one to modify its contents but you have to be granted
ptrace-may-access which i consider as enough for security.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  7:23                         ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-27  7:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan, Oleg Nesterov,
	Eric W. Biederman, Al Viro, Kirill A. Shutemov, Peter Feiner,
	Grant Likely, Siddhesh Poyarekar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, linux-api-u79uwXL29TY76Z2rM5mHXA, Kees Cook

On Mon, Jan 26, 2015 at 10:50:23PM -0800, Andrew Morton wrote:
> On Tue, 27 Jan 2015 09:46:47 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> > > There's one other problem here: we're assuming that the map_files
> > > implementation doesn't have bugs.  If it does have bugs then relaxing
> > > permissions like this will create new vulnerabilities.  And the
> > > map_files implementation is surprisingly complex.  Is it bug-free?
> > 
> > I didn't find any bugs in map-files (and we use it for long time already)
> > so I think it is safe.
> 
> You've been using map_files the way it was supposed to be used so no,
> any bugs won't show up.  What happens if you don your evil black hat
> and use map_files in ways that weren't anticipated?  Attack it?

Hard to say, Andrew. If I found a way to exploit this feature for
bad purpose for sure I would patch it out. At the moment I don't
see any. Touching another process memory via file descriptor
allows one to modify its contents but you have to be granted
ptrace-may-access which i consider as enough for security.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  7:37                       ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-27  7:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, LKML,
	kernel-team, Pavel Emelyanov, Linux API

On Mon, Jan 26, 2015 at 04:15:26PM -0800, Kees Cook wrote:
> >
> > akpm3:/usr/src/25> grep -r map_files Documentation
> 
> If akpm's comments weren't clear: this needs to be fixed. Everything
> in /proc should appear in Documentation.

I'll do that.

> > The 640708a2cff7f81 changelog says:
> >
> > :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> > :     symlinks one for each mapping with file, the name of a symlink is
> > :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> > :     results in a file that point exactly to the same inode as them vma's one.
> > :
> > :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> > :
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> How is mmap offset represented in this output?

We're printing vm_area_struct:[vm_start;vm_end] only.

> > afacit this info is also available in /proc/pid/maps, so things
> > shouldn't get worse if the /proc/pid/map_files permissions are at least
> > as restrictive as the /proc/pid/maps permissions.  Is that the case?
> > (Please add to changelog).
> 
> Both maps and map_files uses ptrace_may_access (via mm_acces) with
> PTRACE_MODE_READ, so I'm happy from a info leak perspective.
> 
> Are mount namespaces handled in this output?

Could you clarify this moment, i'm not sure i get it.

> 
> > There's one other problem here: we're assuming that the map_files
> > implementation doesn't have bugs.  If it does have bugs then relaxing
> > permissions like this will create new vulnerabilities.  And the
> > map_files implementation is surprisingly complex.  Is it bug-free?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27  7:37                       ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-27  7:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, LKML,
	kernel-team-b10kYP2dOMg, Pavel Emelyanov, Linux API

On Mon, Jan 26, 2015 at 04:15:26PM -0800, Kees Cook wrote:
> >
> > akpm3:/usr/src/25> grep -r map_files Documentation
> 
> If akpm's comments weren't clear: this needs to be fixed. Everything
> in /proc should appear in Documentation.

I'll do that.

> > The 640708a2cff7f81 changelog says:
> >
> > :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> > :     symlinks one for each mapping with file, the name of a symlink is
> > :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> > :     results in a file that point exactly to the same inode as them vma's one.
> > :
> > :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> > :
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> How is mmap offset represented in this output?

We're printing vm_area_struct:[vm_start;vm_end] only.

> > afacit this info is also available in /proc/pid/maps, so things
> > shouldn't get worse if the /proc/pid/map_files permissions are at least
> > as restrictive as the /proc/pid/maps permissions.  Is that the case?
> > (Please add to changelog).
> 
> Both maps and map_files uses ptrace_may_access (via mm_acces) with
> PTRACE_MODE_READ, so I'm happy from a info leak perspective.
> 
> Are mount namespaces handled in this output?

Could you clarify this moment, i'm not sure i get it.

> 
> > There's one other problem here: we're assuming that the map_files
> > implementation doesn't have bugs.  If it does have bugs then relaxing
> > permissions like this will create new vulnerabilities.  And the
> > map_files implementation is surprisingly complex.  Is it bug-free?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-27  7:37                       ` Cyrill Gorcunov
@ 2015-01-27 19:53                         ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2015-01-27 19:53 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, LKML,
	kernel-team, Pavel Emelyanov, Linux API

On Mon, Jan 26, 2015 at 11:37 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Mon, Jan 26, 2015 at 04:15:26PM -0800, Kees Cook wrote:
>> >
>> > akpm3:/usr/src/25> grep -r map_files Documentation
>>
>> If akpm's comments weren't clear: this needs to be fixed. Everything
>> in /proc should appear in Documentation.
>
> I'll do that.
>
>> > The 640708a2cff7f81 changelog says:
>> >
>> > :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
>> > :     symlinks one for each mapping with file, the name of a symlink is
>> > :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
>> > :     results in a file that point exactly to the same inode as them vma's one.
>> > :
>> > :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
>> > :
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
>>
>> How is mmap offset represented in this output?
>
> We're printing vm_area_struct:[vm_start;vm_end] only.
>
>> > afacit this info is also available in /proc/pid/maps, so things
>> > shouldn't get worse if the /proc/pid/map_files permissions are at least
>> > as restrictive as the /proc/pid/maps permissions.  Is that the case?
>> > (Please add to changelog).
>>
>> Both maps and map_files uses ptrace_may_access (via mm_acces) with
>> PTRACE_MODE_READ, so I'm happy from a info leak perspective.
>>
>> Are mount namespaces handled in this output?
>
> Could you clarify this moment, i'm not sure i get it.

I changed how I asked this question in my review of the documentation,
but it looks like these symlinks aren't "regular" symlinks (that are
up to the follower to have access to the file system path shown), but
rather they bypass VFS. As a result, I'm wondering how things like
mount namespaces might change this behavior: what is shown, the path
from the perspective of the target, or from the viewer (which may be
in separate mount namespaces).

-Kees

>
>>
>> > There's one other problem here: we're assuming that the map_files
>> > implementation doesn't have bugs.  If it does have bugs then relaxing
>> > permissions like this will create new vulnerabilities.  And the
>> > map_files implementation is surprisingly complex.  Is it bug-free?



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27 19:53                         ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2015-01-27 19:53 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, LKML,
	kernel-team-b10kYP2dOMg, Pavel Emelyanov, Linux API

On Mon, Jan 26, 2015 at 11:37 PM, Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Jan 26, 2015 at 04:15:26PM -0800, Kees Cook wrote:
>> >
>> > akpm3:/usr/src/25> grep -r map_files Documentation
>>
>> If akpm's comments weren't clear: this needs to be fixed. Everything
>> in /proc should appear in Documentation.
>
> I'll do that.
>
>> > The 640708a2cff7f81 changelog says:
>> >
>> > :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
>> > :     symlinks one for each mapping with file, the name of a symlink is
>> > :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
>> > :     results in a file that point exactly to the same inode as them vma's one.
>> > :
>> > :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
>> > :
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>> > :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
>>
>> How is mmap offset represented in this output?
>
> We're printing vm_area_struct:[vm_start;vm_end] only.
>
>> > afacit this info is also available in /proc/pid/maps, so things
>> > shouldn't get worse if the /proc/pid/map_files permissions are at least
>> > as restrictive as the /proc/pid/maps permissions.  Is that the case?
>> > (Please add to changelog).
>>
>> Both maps and map_files uses ptrace_may_access (via mm_acces) with
>> PTRACE_MODE_READ, so I'm happy from a info leak perspective.
>>
>> Are mount namespaces handled in this output?
>
> Could you clarify this moment, i'm not sure i get it.

I changed how I asked this question in my review of the documentation,
but it looks like these symlinks aren't "regular" symlinks (that are
up to the follower to have access to the file system path shown), but
rather they bypass VFS. As a result, I'm wondering how things like
mount namespaces might change this behavior: what is shown, the path
from the perspective of the target, or from the viewer (which may be
in separate mount namespaces).

-Kees

>
>>
>> > There's one other problem here: we're assuming that the map_files
>> > implementation doesn't have bugs.  If it does have bugs then relaxing
>> > permissions like this will create new vulnerabilities.  And the
>> > map_files implementation is surprisingly complex.  Is it bug-free?



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27 21:35                           ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-27 21:35 UTC (permalink / raw)
  To: Kees Cook, Pavel Emelyanov
  Cc: Andrew Morton, Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, LKML,
	kernel-team, Linux API

On Tue, Jan 27, 2015 at 11:53:19AM -0800, Kees Cook wrote:
> >>
> >> Are mount namespaces handled in this output?
> >
> > Could you clarify this moment, i'm not sure i get it.
> 
> I changed how I asked this question in my review of the documentation,
> but it looks like these symlinks aren't "regular" symlinks (that are
> up to the follower to have access to the file system path shown), but
> rather they bypass VFS. As a result, I'm wondering how things like
> mount namespaces might change this behavior: what is shown, the path
> from the perspective of the target, or from the viewer (which may be
> in separate mount namespaces).

I must admit I personally didn't investigating how mount namespaces
might itercat with map-files. Pavel, could you share the thoughts?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27 21:35                           ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-01-27 21:35 UTC (permalink / raw)
  To: Kees Cook, Pavel Emelyanov
  Cc: Andrew Morton, Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, LKML,
	kernel-team-b10kYP2dOMg, Linux API

On Tue, Jan 27, 2015 at 11:53:19AM -0800, Kees Cook wrote:
> >>
> >> Are mount namespaces handled in this output?
> >
> > Could you clarify this moment, i'm not sure i get it.
> 
> I changed how I asked this question in my review of the documentation,
> but it looks like these symlinks aren't "regular" symlinks (that are
> up to the follower to have access to the file system path shown), but
> rather they bypass VFS. As a result, I'm wondering how things like
> mount namespaces might change this behavior: what is shown, the path
> from the perspective of the target, or from the viewer (which may be
> in separate mount namespaces).

I must admit I personally didn't investigating how mount namespaces
might itercat with map-files. Pavel, could you share the thoughts?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-27 19:53                         ` Kees Cook
@ 2015-01-27 21:46                           ` Pavel Emelyanov
  -1 siblings, 0 replies; 80+ messages in thread
From: Pavel Emelyanov @ 2015-01-27 21:46 UTC (permalink / raw)
  To: Kees Cook, Cyrill Gorcunov
  Cc: Andrew Morton, Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, LKML,
	kernel-team, Pavel Emelyanov, Linux API


>>> Are mount namespaces handled in this output?
>>
>> Could you clarify this moment, i'm not sure i get it.
> 
> I changed how I asked this question in my review of the documentation,
> but it looks like these symlinks aren't "regular" symlinks (that are
> up to the follower to have access to the file system path shown), but
> rather they bypass VFS. As a result, I'm wondering how things like
> mount namespaces might change this behavior: what is shown, the path
> from the perspective of the target, or from the viewer (which may be
> in separate mount namespaces).

These work just like the /proc/$pid/fd/$n links do. When you readlink
on it the d_path() is called which walks up the dentry/vfsmnt tree
until it reaches either current root or the global one. For "another"
mount namespace case it produces the path relative to this namespace's
root.

Thanks,
Pavel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-27 21:46                           ` Pavel Emelyanov
  0 siblings, 0 replies; 80+ messages in thread
From: Pavel Emelyanov @ 2015-01-27 21:46 UTC (permalink / raw)
  To: Kees Cook, Cyrill Gorcunov
  Cc: Andrew Morton, Kirill A. Shutemov, Calvin Owens, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, LKML,
	kernel-team, Pavel Emelyanov, Linux API


>>> Are mount namespaces handled in this output?
>>
>> Could you clarify this moment, i'm not sure i get it.
> 
> I changed how I asked this question in my review of the documentation,
> but it looks like these symlinks aren't "regular" symlinks (that are
> up to the follower to have access to the file system path shown), but
> rather they bypass VFS. As a result, I'm wondering how things like
> mount namespaces might change this behavior: what is shown, the path
> from the perspective of the target, or from the viewer (which may be
> in separate mount namespaces).

These work just like the /proc/$pid/fd/$n links do. When you readlink
on it the d_path() is called which walks up the dentry/vfsmnt tree
until it reaches either current root or the global one. For "another"
mount namespace case it produces the path relative to this namespace's
root.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-26 23:43                 ` Andrew Morton
@ 2015-01-28  4:38                     ` Calvin Owens
  2015-01-27  0:19                     ` Kirill A. Shutemov
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-28  4:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Kirill A. Shutemov, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov, linux-api, Kees Cook

On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > > > is very useful for enumerating the files mapped into a process when
> > > > the more verbose information in /proc/<pid>/maps is not needed.
> 
> This is the main (actually only) justification for the patch, and it it
> far too thin.  What does "not needed" mean.  Why can't people just use
> /proc/pid/maps?

The biggest difference is that if you do something like this:

	fd = open("/stuff", O_BLAH);
	map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
	close(fd);
	unlink("/stuff");
 
...then map_files/ gives you a way to get a file descriptor for
"/stuff", which you couldn't do with /proc/pid/maps.

It's also something of a win if you just want to see what is mapped at a
specific address, since you can just readlink() the symlink for the
address range you care about and it will go grab the appropriate VMA and
give you the answer. /proc/pid/maps requires walking the VMA tree, which
is quite expensive for processes with many thousands of threads, even
without the O(N^2) issue.

(You have to know what address range you want though, since readdir() on
map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)

> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> > > > the ability to ptrace the process in question, so this doesn't allow
> > > > an attacker to do anything they couldn't already do before.
> > > > 
> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> > > 
> > > Cc +linux-api@
> > 
> > Looks good to me, thanks! Though I would really appreciate if someone
> > from security camp take a look as well.
> 
> hm, who's that.  Kees comes to mind.
> 
> And reviewers' task would be a heck of a lot easier if they knew what
> /proc/pid/map_files actually does.  This:
> 
> akpm3:/usr/src/25> grep -r map_files Documentation 
> akpm3:/usr/src/25> 
> 
> does not help.
> 
> The 640708a2cff7f81 changelog says:
> 
> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> :     symlinks one for each mapping with file, the name of a symlink is
> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> :     results in a file that point exactly to the same inode as them vma's one.
> :     
> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> :     
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> afacit this info is also available in /proc/pid/maps, so things
> shouldn't get worse if the /proc/pid/map_files permissions are at least
> as restrictive as the /proc/pid/maps permissions.  Is that the case? 
> (Please add to changelog). 

Yes, the only difference is that you can follow the link as per above.
I'll resend with a new message explaining that and the deletion thing.
 
> There's one other problem here: we're assuming that the map_files
> implementation doesn't have bugs.  If it does have bugs then relaxing
> permissions like this will create new vulnerabilities.  And the
> map_files implementation is surprisingly complex.  Is it bug-free?

While I was messing with it I used it a good bit and didn't see any
issues, although I didn't actively try to fuzz it or anything. I'd be
happy to write something to test hammering it in weird ways if you like.
I'm also happy to write testcases for namespaces.

So far as security issues, as others have pointed out you can't follow
the links unless you can ptrace the process in question, which seems
like a pretty solid guarantee. As Cyrill pointed out in the discussion
about the documentation, that's the same protection as /proc/N/fd/*, and
those links function in the same way.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-28  4:38                     ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-28  4:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Kirill A. Shutemov, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov, linux-api, Kees Cook

On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> > > > is very useful for enumerating the files mapped into a process when
> > > > the more verbose information in /proc/<pid>/maps is not needed.
> 
> This is the main (actually only) justification for the patch, and it it
> far too thin.  What does "not needed" mean.  Why can't people just use
> /proc/pid/maps?

The biggest difference is that if you do something like this:

	fd = open("/stuff", O_BLAH);
	map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
	close(fd);
	unlink("/stuff");
 
...then map_files/ gives you a way to get a file descriptor for
"/stuff", which you couldn't do with /proc/pid/maps.

It's also something of a win if you just want to see what is mapped at a
specific address, since you can just readlink() the symlink for the
address range you care about and it will go grab the appropriate VMA and
give you the answer. /proc/pid/maps requires walking the VMA tree, which
is quite expensive for processes with many thousands of threads, even
without the O(N^2) issue.

(You have to know what address range you want though, since readdir() on
map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)

> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> > > > the ability to ptrace the process in question, so this doesn't allow
> > > > an attacker to do anything they couldn't already do before.
> > > > 
> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> > > 
> > > Cc +linux-api@
> > 
> > Looks good to me, thanks! Though I would really appreciate if someone
> > from security camp take a look as well.
> 
> hm, who's that.  Kees comes to mind.
> 
> And reviewers' task would be a heck of a lot easier if they knew what
> /proc/pid/map_files actually does.  This:
> 
> akpm3:/usr/src/25> grep -r map_files Documentation 
> akpm3:/usr/src/25> 
> 
> does not help.
> 
> The 640708a2cff7f81 changelog says:
> 
> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> :     symlinks one for each mapping with file, the name of a symlink is
> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> :     results in a file that point exactly to the same inode as them vma's one.
> :     
> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> :     
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> afacit this info is also available in /proc/pid/maps, so things
> shouldn't get worse if the /proc/pid/map_files permissions are at least
> as restrictive as the /proc/pid/maps permissions.  Is that the case? 
> (Please add to changelog). 

Yes, the only difference is that you can follow the link as per above.
I'll resend with a new message explaining that and the deletion thing.
 
> There's one other problem here: we're assuming that the map_files
> implementation doesn't have bugs.  If it does have bugs then relaxing
> permissions like this will create new vulnerabilities.  And the
> map_files implementation is surprisingly complex.  Is it bug-free?

While I was messing with it I used it a good bit and didn't see any
issues, although I didn't actively try to fuzz it or anything. I'd be
happy to write something to test hammering it in weird ways if you like.
I'm also happy to write testcases for namespaces.

So far as security issues, as others have pointed out you can't follow
the links unless you can ptrace the process in question, which seems
like a pretty solid guarantee. As Cyrill pointed out in the discussion
about the documentation, that's the same protection as /proc/N/fd/*, and
those links function in the same way.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-30  1:30                       ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2015-01-30  1:30 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov,
	Linux API

On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens@fb.com> wrote:
> On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
>> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>>
>> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
>> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
>> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
>> > > > is very useful for enumerating the files mapped into a process when
>> > > > the more verbose information in /proc/<pid>/maps is not needed.
>>
>> This is the main (actually only) justification for the patch, and it it
>> far too thin.  What does "not needed" mean.  Why can't people just use
>> /proc/pid/maps?
>
> The biggest difference is that if you do something like this:
>
>         fd = open("/stuff", O_BLAH);
>         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
>         close(fd);
>         unlink("/stuff");
>
> ...then map_files/ gives you a way to get a file descriptor for
> "/stuff", which you couldn't do with /proc/pid/maps.
>
> It's also something of a win if you just want to see what is mapped at a
> specific address, since you can just readlink() the symlink for the
> address range you care about and it will go grab the appropriate VMA and
> give you the answer. /proc/pid/maps requires walking the VMA tree, which
> is quite expensive for processes with many thousands of threads, even
> without the O(N^2) issue.
>
> (You have to know what address range you want though, since readdir() on
> map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
>
>> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
>> > > > the ability to ptrace the process in question, so this doesn't allow
>> > > > an attacker to do anything they couldn't already do before.
>> > > >
>> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
>> > >
>> > > Cc +linux-api@
>> >
>> > Looks good to me, thanks! Though I would really appreciate if someone
>> > from security camp take a look as well.
>>
>> hm, who's that.  Kees comes to mind.
>>
>> And reviewers' task would be a heck of a lot easier if they knew what
>> /proc/pid/map_files actually does.  This:
>>
>> akpm3:/usr/src/25> grep -r map_files Documentation
>> akpm3:/usr/src/25>
>>
>> does not help.
>>
>> The 640708a2cff7f81 changelog says:
>>
>> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
>> :     symlinks one for each mapping with file, the name of a symlink is
>> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
>> :     results in a file that point exactly to the same inode as them vma's one.
>> :
>> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
>> :
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
>>
>> afacit this info is also available in /proc/pid/maps, so things
>> shouldn't get worse if the /proc/pid/map_files permissions are at least
>> as restrictive as the /proc/pid/maps permissions.  Is that the case?
>> (Please add to changelog).
>
> Yes, the only difference is that you can follow the link as per above.
> I'll resend with a new message explaining that and the deletion thing.
>
>> There's one other problem here: we're assuming that the map_files
>> implementation doesn't have bugs.  If it does have bugs then relaxing
>> permissions like this will create new vulnerabilities.  And the
>> map_files implementation is surprisingly complex.  Is it bug-free?
>
> While I was messing with it I used it a good bit and didn't see any
> issues, although I didn't actively try to fuzz it or anything. I'd be
> happy to write something to test hammering it in weird ways if you like.
> I'm also happy to write testcases for namespaces.
>
> So far as security issues, as others have pointed out you can't follow
> the links unless you can ptrace the process in question, which seems
> like a pretty solid guarantee. As Cyrill pointed out in the discussion
> about the documentation, that's the same protection as /proc/N/fd/*, and
> those links function in the same way.

My concern here is that fd/* are connected as streams, and while that
has a certain level of badness as an external-to-the-process attacker,
PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
required for access to /proc/N/mem). Since these fds are the things
mapped into memory on a process, writing to them is a subset of access
to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-30  1:30                       ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2015-01-30  1:30 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, Linux API

On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
>> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
>> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
>> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
>> > > > is very useful for enumerating the files mapped into a process when
>> > > > the more verbose information in /proc/<pid>/maps is not needed.
>>
>> This is the main (actually only) justification for the patch, and it it
>> far too thin.  What does "not needed" mean.  Why can't people just use
>> /proc/pid/maps?
>
> The biggest difference is that if you do something like this:
>
>         fd = open("/stuff", O_BLAH);
>         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
>         close(fd);
>         unlink("/stuff");
>
> ...then map_files/ gives you a way to get a file descriptor for
> "/stuff", which you couldn't do with /proc/pid/maps.
>
> It's also something of a win if you just want to see what is mapped at a
> specific address, since you can just readlink() the symlink for the
> address range you care about and it will go grab the appropriate VMA and
> give you the answer. /proc/pid/maps requires walking the VMA tree, which
> is quite expensive for processes with many thousands of threads, even
> without the O(N^2) issue.
>
> (You have to know what address range you want though, since readdir() on
> map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
>
>> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
>> > > > the ability to ptrace the process in question, so this doesn't allow
>> > > > an attacker to do anything they couldn't already do before.
>> > > >
>> > > > Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
>> > >
>> > > Cc +linux-api@
>> >
>> > Looks good to me, thanks! Though I would really appreciate if someone
>> > from security camp take a look as well.
>>
>> hm, who's that.  Kees comes to mind.
>>
>> And reviewers' task would be a heck of a lot easier if they knew what
>> /proc/pid/map_files actually does.  This:
>>
>> akpm3:/usr/src/25> grep -r map_files Documentation
>> akpm3:/usr/src/25>
>>
>> does not help.
>>
>> The 640708a2cff7f81 changelog says:
>>
>> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
>> :     symlinks one for each mapping with file, the name of a symlink is
>> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
>> :     results in a file that point exactly to the same inode as them vma's one.
>> :
>> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
>> :
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
>>
>> afacit this info is also available in /proc/pid/maps, so things
>> shouldn't get worse if the /proc/pid/map_files permissions are at least
>> as restrictive as the /proc/pid/maps permissions.  Is that the case?
>> (Please add to changelog).
>
> Yes, the only difference is that you can follow the link as per above.
> I'll resend with a new message explaining that and the deletion thing.
>
>> There's one other problem here: we're assuming that the map_files
>> implementation doesn't have bugs.  If it does have bugs then relaxing
>> permissions like this will create new vulnerabilities.  And the
>> map_files implementation is surprisingly complex.  Is it bug-free?
>
> While I was messing with it I used it a good bit and didn't see any
> issues, although I didn't actively try to fuzz it or anything. I'd be
> happy to write something to test hammering it in weird ways if you like.
> I'm also happy to write testcases for namespaces.
>
> So far as security issues, as others have pointed out you can't follow
> the links unless you can ptrace the process in question, which seems
> like a pretty solid guarantee. As Cyrill pointed out in the discussion
> about the documentation, that's the same protection as /proc/N/fd/*, and
> those links function in the same way.

My concern here is that fd/* are connected as streams, and while that
has a certain level of badness as an external-to-the-process attacker,
PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
required for access to /proc/N/mem). Since these fds are the things
mapped into memory on a process, writing to them is a subset of access
to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-31  1:58                         ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-31  1:58 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov,
	Linux API

On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens@fb.com> wrote:
> > On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> >>
> >> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >> > > > is very useful for enumerating the files mapped into a process when
> >> > > > the more verbose information in /proc/<pid>/maps is not needed.
> >>
> >> This is the main (actually only) justification for the patch, and it it
> >> far too thin.  What does "not needed" mean.  Why can't people just use
> >> /proc/pid/maps?
> >
> > The biggest difference is that if you do something like this:
> >
> >         fd = open("/stuff", O_BLAH);
> >         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >         close(fd);
> >         unlink("/stuff");
> >
> > ...then map_files/ gives you a way to get a file descriptor for
> > "/stuff", which you couldn't do with /proc/pid/maps.
> >
> > It's also something of a win if you just want to see what is mapped at a
> > specific address, since you can just readlink() the symlink for the
> > address range you care about and it will go grab the appropriate VMA and
> > give you the answer. /proc/pid/maps requires walking the VMA tree, which
> > is quite expensive for processes with many thousands of threads, even
> > without the O(N^2) issue.
> >
> > (You have to know what address range you want though, since readdir() on
> > map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >
> >> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >> > > > the ability to ptrace the process in question, so this doesn't allow
> >> > > > an attacker to do anything they couldn't already do before.
> >> > > >
> >> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> >> > >
> >> > > Cc +linux-api@
> >> >
> >> > Looks good to me, thanks! Though I would really appreciate if someone
> >> > from security camp take a look as well.
> >>
> >> hm, who's that.  Kees comes to mind.
> >>
> >> And reviewers' task would be a heck of a lot easier if they knew what
> >> /proc/pid/map_files actually does.  This:
> >>
> >> akpm3:/usr/src/25> grep -r map_files Documentation
> >> akpm3:/usr/src/25>
> >>
> >> does not help.
> >>
> >> The 640708a2cff7f81 changelog says:
> >>
> >> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >> :     symlinks one for each mapping with file, the name of a symlink is
> >> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> >> :     results in a file that point exactly to the same inode as them vma's one.
> >> :
> >> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >> :
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >>
> >> afacit this info is also available in /proc/pid/maps, so things
> >> shouldn't get worse if the /proc/pid/map_files permissions are at least
> >> as restrictive as the /proc/pid/maps permissions.  Is that the case?
> >> (Please add to changelog).
> >
> > Yes, the only difference is that you can follow the link as per above.
> > I'll resend with a new message explaining that and the deletion thing.
> >
> >> There's one other problem here: we're assuming that the map_files
> >> implementation doesn't have bugs.  If it does have bugs then relaxing
> >> permissions like this will create new vulnerabilities.  And the
> >> map_files implementation is surprisingly complex.  Is it bug-free?
> >
> > While I was messing with it I used it a good bit and didn't see any
> > issues, although I didn't actively try to fuzz it or anything. I'd be
> > happy to write something to test hammering it in weird ways if you like.
> > I'm also happy to write testcases for namespaces.
> >
> > So far as security issues, as others have pointed out you can't follow
> > the links unless you can ptrace the process in question, which seems
> > like a pretty solid guarantee. As Cyrill pointed out in the discussion
> > about the documentation, that's the same protection as /proc/N/fd/*, and
> > those links function in the same way.
> 
> My concern here is that fd/* are connected as streams, and while that
> has a certain level of badness as an external-to-the-process attacker,
> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> required for access to /proc/N/mem). Since these fds are the things
> mapped into memory on a process, writing to them is a subset of access
> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.

If you haven't done close() on a mmapped file, doesn't fd/* allow the
same access to the corresponding regions of memory? Or am I missing
something?
 
But that said, I can't think of any reason making it MODE_ATTACH would
be a problem. Would you rather that be enforced on follow_link() like
the original patch did, or enforce it for the whole directory?

Thanks,
Calvin
 
> -Kees
> 
> -- 
> Kees Cook
> Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-01-31  1:58                         ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-01-31  1:58 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, Linux API

On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> > On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>
> >> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >> > > > is very useful for enumerating the files mapped into a process when
> >> > > > the more verbose information in /proc/<pid>/maps is not needed.
> >>
> >> This is the main (actually only) justification for the patch, and it it
> >> far too thin.  What does "not needed" mean.  Why can't people just use
> >> /proc/pid/maps?
> >
> > The biggest difference is that if you do something like this:
> >
> >         fd = open("/stuff", O_BLAH);
> >         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >         close(fd);
> >         unlink("/stuff");
> >
> > ...then map_files/ gives you a way to get a file descriptor for
> > "/stuff", which you couldn't do with /proc/pid/maps.
> >
> > It's also something of a win if you just want to see what is mapped at a
> > specific address, since you can just readlink() the symlink for the
> > address range you care about and it will go grab the appropriate VMA and
> > give you the answer. /proc/pid/maps requires walking the VMA tree, which
> > is quite expensive for processes with many thousands of threads, even
> > without the O(N^2) issue.
> >
> > (You have to know what address range you want though, since readdir() on
> > map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >
> >> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >> > > > the ability to ptrace the process in question, so this doesn't allow
> >> > > > an attacker to do anything they couldn't already do before.
> >> > > >
> >> > > > Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> >> > >
> >> > > Cc +linux-api@
> >> >
> >> > Looks good to me, thanks! Though I would really appreciate if someone
> >> > from security camp take a look as well.
> >>
> >> hm, who's that.  Kees comes to mind.
> >>
> >> And reviewers' task would be a heck of a lot easier if they knew what
> >> /proc/pid/map_files actually does.  This:
> >>
> >> akpm3:/usr/src/25> grep -r map_files Documentation
> >> akpm3:/usr/src/25>
> >>
> >> does not help.
> >>
> >> The 640708a2cff7f81 changelog says:
> >>
> >> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >> :     symlinks one for each mapping with file, the name of a symlink is
> >> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> >> :     results in a file that point exactly to the same inode as them vma's one.
> >> :
> >> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >> :
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >>
> >> afacit this info is also available in /proc/pid/maps, so things
> >> shouldn't get worse if the /proc/pid/map_files permissions are at least
> >> as restrictive as the /proc/pid/maps permissions.  Is that the case?
> >> (Please add to changelog).
> >
> > Yes, the only difference is that you can follow the link as per above.
> > I'll resend with a new message explaining that and the deletion thing.
> >
> >> There's one other problem here: we're assuming that the map_files
> >> implementation doesn't have bugs.  If it does have bugs then relaxing
> >> permissions like this will create new vulnerabilities.  And the
> >> map_files implementation is surprisingly complex.  Is it bug-free?
> >
> > While I was messing with it I used it a good bit and didn't see any
> > issues, although I didn't actively try to fuzz it or anything. I'd be
> > happy to write something to test hammering it in weird ways if you like.
> > I'm also happy to write testcases for namespaces.
> >
> > So far as security issues, as others have pointed out you can't follow
> > the links unless you can ptrace the process in question, which seems
> > like a pretty solid guarantee. As Cyrill pointed out in the discussion
> > about the documentation, that's the same protection as /proc/N/fd/*, and
> > those links function in the same way.
> 
> My concern here is that fd/* are connected as streams, and while that
> has a certain level of badness as an external-to-the-process attacker,
> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> required for access to /proc/N/mem). Since these fds are the things
> mapped into memory on a process, writing to them is a subset of access
> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.

If you haven't done close() on a mmapped file, doesn't fd/* allow the
same access to the corresponding regions of memory? Or am I missing
something?
 
But that said, I can't think of any reason making it MODE_ATTACH would
be a problem. Would you rather that be enforced on follow_link() like
the original patch did, or enforce it for the whole directory?

Thanks,
Calvin
 
> -Kees
> 
> -- 
> Kees Cook
> Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-31  1:58                         ` Calvin Owens
  (?)
@ 2015-02-02 14:01                         ` Austin S Hemmelgarn
  2015-02-04  3:53                             ` Calvin Owens
  -1 siblings, 1 reply; 80+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-02 14:01 UTC (permalink / raw)
  To: Calvin Owens, Kees Cook
  Cc: Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov,
	Linux API

[-- Attachment #1: Type: text/plain, Size: 6117 bytes --]

On 2015-01-30 20:58, Calvin Owens wrote:
> On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
>> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens@fb.com> wrote:
>>> On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
>>>> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>>>>
>>>>> On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
>>>>>> On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
>>>>>>> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>>>>>>> is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
>>>>>>> is very useful for enumerating the files mapped into a process when
>>>>>>> the more verbose information in /proc/<pid>/maps is not needed.
>>>>
>>>> This is the main (actually only) justification for the patch, and it it
>>>> far too thin.  What does "not needed" mean.  Why can't people just use
>>>> /proc/pid/maps?
>>>
>>> The biggest difference is that if you do something like this:
>>>
>>>          fd = open("/stuff", O_BLAH);
>>>          map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
>>>          close(fd);
>>>          unlink("/stuff");
>>>
>>> ...then map_files/ gives you a way to get a file descriptor for
>>> "/stuff", which you couldn't do with /proc/pid/maps.
>>>
>>> It's also something of a win if you just want to see what is mapped at a
>>> specific address, since you can just readlink() the symlink for the
>>> address range you care about and it will go grab the appropriate VMA and
>>> give you the answer. /proc/pid/maps requires walking the VMA tree, which
>>> is quite expensive for processes with many thousands of threads, even
>>> without the O(N^2) issue.
>>>
>>> (You have to know what address range you want though, since readdir() on
>>> map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
>>>
>>>>>>> This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>>>>>>> removes the CAP_SYS_ADMIN restrictions. Following the links requires
>>>>>>> the ability to ptrace the process in question, so this doesn't allow
>>>>>>> an attacker to do anything they couldn't already do before.
>>>>>>>
>>>>>>> Signed-off-by: Calvin Owens <calvinowens@fb.com>
>>>>>>
>>>>>> Cc +linux-api@
>>>>>
>>>>> Looks good to me, thanks! Though I would really appreciate if someone
>>>>> from security camp take a look as well.
>>>>
>>>> hm, who's that.  Kees comes to mind.
>>>>
>>>> And reviewers' task would be a heck of a lot easier if they knew what
>>>> /proc/pid/map_files actually does.  This:
>>>>
>>>> akpm3:/usr/src/25> grep -r map_files Documentation
>>>> akpm3:/usr/src/25>
>>>>
>>>> does not help.
>>>>
>>>> The 640708a2cff7f81 changelog says:
>>>>
>>>> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
>>>> :     symlinks one for each mapping with file, the name of a symlink is
>>>> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
>>>> :     results in a file that point exactly to the same inode as them vma's one.
>>>> :
>>>> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
>>>> :
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
>>>>
>>>> afacit this info is also available in /proc/pid/maps, so things
>>>> shouldn't get worse if the /proc/pid/map_files permissions are at least
>>>> as restrictive as the /proc/pid/maps permissions.  Is that the case?
>>>> (Please add to changelog).
>>>
>>> Yes, the only difference is that you can follow the link as per above.
>>> I'll resend with a new message explaining that and the deletion thing.
>>>
>>>> There's one other problem here: we're assuming that the map_files
>>>> implementation doesn't have bugs.  If it does have bugs then relaxing
>>>> permissions like this will create new vulnerabilities.  And the
>>>> map_files implementation is surprisingly complex.  Is it bug-free?
>>>
>>> While I was messing with it I used it a good bit and didn't see any
>>> issues, although I didn't actively try to fuzz it or anything. I'd be
>>> happy to write something to test hammering it in weird ways if you like.
>>> I'm also happy to write testcases for namespaces.
>>>
>>> So far as security issues, as others have pointed out you can't follow
>>> the links unless you can ptrace the process in question, which seems
>>> like a pretty solid guarantee. As Cyrill pointed out in the discussion
>>> about the documentation, that's the same protection as /proc/N/fd/*, and
>>> those links function in the same way.
>>
>> My concern here is that fd/* are connected as streams, and while that
>> has a certain level of badness as an external-to-the-process attacker,
>> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
>> required for access to /proc/N/mem). Since these fds are the things
>> mapped into memory on a process, writing to them is a subset of access
>> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
>
> If you haven't done close() on a mmapped file, doesn't fd/* allow the
> same access to the corresponding regions of memory? Or am I missing
> something?
>
> But that said, I can't think of any reason making it MODE_ATTACH would
> be a problem. Would you rather that be enforced on follow_link() like
> the original patch did, or enforce it for the whole directory?
>
Whole directory would probably be better, as even just the mapped ranges 
could be considered sensitive information.  Ideally, the check should be 
done on both follow_link(), and the directory itself.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-31  1:58                         ` Calvin Owens
  (?)
  (?)
@ 2015-02-02 20:16                         ` Andy Lutomirski
  2015-02-04  3:28                             ` Calvin Owens
  -1 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2015-02-02 20:16 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov,
	Linux API

On Fri, Jan 30, 2015 at 5:58 PM, Calvin Owens <calvinowens@fb.com> wrote:
> On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
>> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens@fb.com> wrote:
>> > On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
>> >> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>> >>
>> >> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
>> >> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
>> >> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> >> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
>> >> > > > is very useful for enumerating the files mapped into a process when
>> >> > > > the more verbose information in /proc/<pid>/maps is not needed.
>> >>
>> >> This is the main (actually only) justification for the patch, and it it
>> >> far too thin.  What does "not needed" mean.  Why can't people just use
>> >> /proc/pid/maps?
>> >
>> > The biggest difference is that if you do something like this:
>> >
>> >         fd = open("/stuff", O_BLAH);
>> >         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
>> >         close(fd);
>> >         unlink("/stuff");
>> >
>> > ...then map_files/ gives you a way to get a file descriptor for
>> > "/stuff", which you couldn't do with /proc/pid/maps.
>> >
>> > It's also something of a win if you just want to see what is mapped at a
>> > specific address, since you can just readlink() the symlink for the
>> > address range you care about and it will go grab the appropriate VMA and
>> > give you the answer. /proc/pid/maps requires walking the VMA tree, which
>> > is quite expensive for processes with many thousands of threads, even
>> > without the O(N^2) issue.
>> >
>> > (You have to know what address range you want though, since readdir() on
>> > map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
>> >
>> >> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>> >> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
>> >> > > > the ability to ptrace the process in question, so this doesn't allow
>> >> > > > an attacker to do anything they couldn't already do before.
>> >> > > >
>> >> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
>> >> > >
>> >> > > Cc +linux-api@
>> >> >
>> >> > Looks good to me, thanks! Though I would really appreciate if someone
>> >> > from security camp take a look as well.
>> >>
>> >> hm, who's that.  Kees comes to mind.
>> >>
>> >> And reviewers' task would be a heck of a lot easier if they knew what
>> >> /proc/pid/map_files actually does.  This:
>> >>
>> >> akpm3:/usr/src/25> grep -r map_files Documentation
>> >> akpm3:/usr/src/25>
>> >>
>> >> does not help.
>> >>
>> >> The 640708a2cff7f81 changelog says:
>> >>
>> >> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
>> >> :     symlinks one for each mapping with file, the name of a symlink is
>> >> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
>> >> :     results in a file that point exactly to the same inode as them vma's one.
>> >> :
>> >> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
>> >> :
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
>> >>
>> >> afacit this info is also available in /proc/pid/maps, so things
>> >> shouldn't get worse if the /proc/pid/map_files permissions are at least
>> >> as restrictive as the /proc/pid/maps permissions.  Is that the case?
>> >> (Please add to changelog).
>> >
>> > Yes, the only difference is that you can follow the link as per above.
>> > I'll resend with a new message explaining that and the deletion thing.
>> >
>> >> There's one other problem here: we're assuming that the map_files
>> >> implementation doesn't have bugs.  If it does have bugs then relaxing
>> >> permissions like this will create new vulnerabilities.  And the
>> >> map_files implementation is surprisingly complex.  Is it bug-free?
>> >
>> > While I was messing with it I used it a good bit and didn't see any
>> > issues, although I didn't actively try to fuzz it or anything. I'd be
>> > happy to write something to test hammering it in weird ways if you like.
>> > I'm also happy to write testcases for namespaces.
>> >
>> > So far as security issues, as others have pointed out you can't follow
>> > the links unless you can ptrace the process in question, which seems
>> > like a pretty solid guarantee. As Cyrill pointed out in the discussion
>> > about the documentation, that's the same protection as /proc/N/fd/*, and
>> > those links function in the same way.
>>
>> My concern here is that fd/* are connected as streams, and while that
>> has a certain level of badness as an external-to-the-process attacker,
>> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
>> required for access to /proc/N/mem). Since these fds are the things
>> mapped into memory on a process, writing to them is a subset of access
>> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
>
> If you haven't done close() on a mmapped file, doesn't fd/* allow the
> same access to the corresponding regions of memory? Or am I missing
> something?
>

But if you have called close(), then you can't currently do things
like ftruncate or ioctl on the mapped file.  These things don't
persist across execve(), but the do persist across calls to setresuid,
etc that drop privileges.  The latter part makes me a tiny bit
nervous.

It also might be worth checking for drivers or arch code that creates
vmas that are backed by a different struct file than the struct file
that was mmapped in the first place.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-02-04  3:28                             ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-02-04  3:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov,
	Linux API

On Monday 02/02 at 12:16 -0800, Andy Lutomirski wrote:
> On Fri, Jan 30, 2015 at 5:58 PM, Calvin Owens <calvinowens@fb.com> wrote:
> > On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> >> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens@fb.com> wrote:
> >> > On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >> >> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> >> >>
> >> >> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >> >> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >> >> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >> >> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >> >> > > > is very useful for enumerating the files mapped into a process when
> >> >> > > > the more verbose information in /proc/<pid>/maps is not needed.
> >> >>
> >> >> This is the main (actually only) justification for the patch, and it it
> >> >> far too thin.  What does "not needed" mean.  Why can't people just use
> >> >> /proc/pid/maps?
> >> >
> >> > The biggest difference is that if you do something like this:
> >> >
> >> >         fd = open("/stuff", O_BLAH);
> >> >         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >> >         close(fd);
> >> >         unlink("/stuff");
> >> >
> >> > ...then map_files/ gives you a way to get a file descriptor for
> >> > "/stuff", which you couldn't do with /proc/pid/maps.
> >> >
> >> > It's also something of a win if you just want to see what is mapped at a
> >> > specific address, since you can just readlink() the symlink for the
> >> > address range you care about and it will go grab the appropriate VMA and
> >> > give you the answer. /proc/pid/maps requires walking the VMA tree, which
> >> > is quite expensive for processes with many thousands of threads, even
> >> > without the O(N^2) issue.
> >> >
> >> > (You have to know what address range you want though, since readdir() on
> >> > map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >> >
> >> >> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >> >> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >> >> > > > the ability to ptrace the process in question, so this doesn't allow
> >> >> > > > an attacker to do anything they couldn't already do before.
> >> >> > > >
> >> >> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
> >> >> > >
> >> >> > > Cc +linux-api@
> >> >> >
> >> >> > Looks good to me, thanks! Though I would really appreciate if someone
> >> >> > from security camp take a look as well.
> >> >>
> >> >> hm, who's that.  Kees comes to mind.
> >> >>
> >> >> And reviewers' task would be a heck of a lot easier if they knew what
> >> >> /proc/pid/map_files actually does.  This:
> >> >>
> >> >> akpm3:/usr/src/25> grep -r map_files Documentation
> >> >> akpm3:/usr/src/25>
> >> >>
> >> >> does not help.
> >> >>
> >> >> The 640708a2cff7f81 changelog says:
> >> >>
> >> >> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >> >> :     symlinks one for each mapping with file, the name of a symlink is
> >> >> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> >> >> :     results in a file that point exactly to the same inode as them vma's one.
> >> >> :
> >> >> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >> >> :
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >> >>
> >> >> afacit this info is also available in /proc/pid/maps, so things
> >> >> shouldn't get worse if the /proc/pid/map_files permissions are at least
> >> >> as restrictive as the /proc/pid/maps permissions.  Is that the case?
> >> >> (Please add to changelog).
> >> >
> >> > Yes, the only difference is that you can follow the link as per above.
> >> > I'll resend with a new message explaining that and the deletion thing.
> >> >
> >> >> There's one other problem here: we're assuming that the map_files
> >> >> implementation doesn't have bugs.  If it does have bugs then relaxing
> >> >> permissions like this will create new vulnerabilities.  And the
> >> >> map_files implementation is surprisingly complex.  Is it bug-free?
> >> >
> >> > While I was messing with it I used it a good bit and didn't see any
> >> > issues, although I didn't actively try to fuzz it or anything. I'd be
> >> > happy to write something to test hammering it in weird ways if you like.
> >> > I'm also happy to write testcases for namespaces.
> >> >
> >> > So far as security issues, as others have pointed out you can't follow
> >> > the links unless you can ptrace the process in question, which seems
> >> > like a pretty solid guarantee. As Cyrill pointed out in the discussion
> >> > about the documentation, that's the same protection as /proc/N/fd/*, and
> >> > those links function in the same way.
> >>
> >> My concern here is that fd/* are connected as streams, and while that
> >> has a certain level of badness as an external-to-the-process attacker,
> >> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> >> required for access to /proc/N/mem). Since these fds are the things
> >> mapped into memory on a process, writing to them is a subset of access
> >> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
> >
> > If you haven't done close() on a mmapped file, doesn't fd/* allow the
> > same access to the corresponding regions of memory? Or am I missing
> > something?
> >
> 
> But if you have called close(), then you can't currently do things
> like ftruncate or ioctl on the mapped file.  These things don't
> persist across execve(), but the do persist across calls to setresuid,
> etc that drop privileges.  The latter part makes me a tiny bit
> nervous.

Hmm, in that scenario you would have to open() the map_files symlink,
and since you've dropped privileges that would only succeed if the user
you dropped to has permission to access that file anyway, right? 

In the deleted file case it does actually allow something that used to
be impossible, but relying on open/map/close/unlink to prevent a user
from opening a file they have permission to open is just buggy in
general.

But, O_TMPFILE lets you end up in that position without the race. The
manpage says that O_TMPFILE files "can never be reached via any
pathname", which isn't strictly true since you can get them from fd/* in
proc. But if you close() after mapping it they are currently truly
inaccessible via any path, and given the language in the manpage it
seems reasonable that somebody might rely on that and be lazy with the
permissions.

I hadn't thought about O_TMPFILE thing: I'm definitely convinced now
that PTRACE_MODE_ATTACH is the right thing here. But I think having to
reopen the file saves you even if you "leak" maps of files across a call
to setresuid/etc. 

> It also might be worth checking for drivers or arch code that creates
> vmas that are backed by a different struct file than the struct file
> that was mmapped in the first place.

Interesting, I'll look into this before I resend.

Thanks,
Calvin
 
> --Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-02-04  3:28                             ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-02-04  3:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, Linux API

On Monday 02/02 at 12:16 -0800, Andy Lutomirski wrote:
> On Fri, Jan 30, 2015 at 5:58 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> > On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> >> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> >> > On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >> >> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> >>
> >> >> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >> >> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >> >> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >> >> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >> >> > > > is very useful for enumerating the files mapped into a process when
> >> >> > > > the more verbose information in /proc/<pid>/maps is not needed.
> >> >>
> >> >> This is the main (actually only) justification for the patch, and it it
> >> >> far too thin.  What does "not needed" mean.  Why can't people just use
> >> >> /proc/pid/maps?
> >> >
> >> > The biggest difference is that if you do something like this:
> >> >
> >> >         fd = open("/stuff", O_BLAH);
> >> >         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >> >         close(fd);
> >> >         unlink("/stuff");
> >> >
> >> > ...then map_files/ gives you a way to get a file descriptor for
> >> > "/stuff", which you couldn't do with /proc/pid/maps.
> >> >
> >> > It's also something of a win if you just want to see what is mapped at a
> >> > specific address, since you can just readlink() the symlink for the
> >> > address range you care about and it will go grab the appropriate VMA and
> >> > give you the answer. /proc/pid/maps requires walking the VMA tree, which
> >> > is quite expensive for processes with many thousands of threads, even
> >> > without the O(N^2) issue.
> >> >
> >> > (You have to know what address range you want though, since readdir() on
> >> > map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >> >
> >> >> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >> >> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >> >> > > > the ability to ptrace the process in question, so this doesn't allow
> >> >> > > > an attacker to do anything they couldn't already do before.
> >> >> > > >
> >> >> > > > Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> >> >> > >
> >> >> > > Cc +linux-api@
> >> >> >
> >> >> > Looks good to me, thanks! Though I would really appreciate if someone
> >> >> > from security camp take a look as well.
> >> >>
> >> >> hm, who's that.  Kees comes to mind.
> >> >>
> >> >> And reviewers' task would be a heck of a lot easier if they knew what
> >> >> /proc/pid/map_files actually does.  This:
> >> >>
> >> >> akpm3:/usr/src/25> grep -r map_files Documentation
> >> >> akpm3:/usr/src/25>
> >> >>
> >> >> does not help.
> >> >>
> >> >> The 640708a2cff7f81 changelog says:
> >> >>
> >> >> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >> >> :     symlinks one for each mapping with file, the name of a symlink is
> >> >> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> >> >> :     results in a file that point exactly to the same inode as them vma's one.
> >> >> :
> >> >> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >> >> :
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >> >>
> >> >> afacit this info is also available in /proc/pid/maps, so things
> >> >> shouldn't get worse if the /proc/pid/map_files permissions are at least
> >> >> as restrictive as the /proc/pid/maps permissions.  Is that the case?
> >> >> (Please add to changelog).
> >> >
> >> > Yes, the only difference is that you can follow the link as per above.
> >> > I'll resend with a new message explaining that and the deletion thing.
> >> >
> >> >> There's one other problem here: we're assuming that the map_files
> >> >> implementation doesn't have bugs.  If it does have bugs then relaxing
> >> >> permissions like this will create new vulnerabilities.  And the
> >> >> map_files implementation is surprisingly complex.  Is it bug-free?
> >> >
> >> > While I was messing with it I used it a good bit and didn't see any
> >> > issues, although I didn't actively try to fuzz it or anything. I'd be
> >> > happy to write something to test hammering it in weird ways if you like.
> >> > I'm also happy to write testcases for namespaces.
> >> >
> >> > So far as security issues, as others have pointed out you can't follow
> >> > the links unless you can ptrace the process in question, which seems
> >> > like a pretty solid guarantee. As Cyrill pointed out in the discussion
> >> > about the documentation, that's the same protection as /proc/N/fd/*, and
> >> > those links function in the same way.
> >>
> >> My concern here is that fd/* are connected as streams, and while that
> >> has a certain level of badness as an external-to-the-process attacker,
> >> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> >> required for access to /proc/N/mem). Since these fds are the things
> >> mapped into memory on a process, writing to them is a subset of access
> >> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
> >
> > If you haven't done close() on a mmapped file, doesn't fd/* allow the
> > same access to the corresponding regions of memory? Or am I missing
> > something?
> >
> 
> But if you have called close(), then you can't currently do things
> like ftruncate or ioctl on the mapped file.  These things don't
> persist across execve(), but the do persist across calls to setresuid,
> etc that drop privileges.  The latter part makes me a tiny bit
> nervous.

Hmm, in that scenario you would have to open() the map_files symlink,
and since you've dropped privileges that would only succeed if the user
you dropped to has permission to access that file anyway, right? 

In the deleted file case it does actually allow something that used to
be impossible, but relying on open/map/close/unlink to prevent a user
from opening a file they have permission to open is just buggy in
general.

But, O_TMPFILE lets you end up in that position without the race. The
manpage says that O_TMPFILE files "can never be reached via any
pathname", which isn't strictly true since you can get them from fd/* in
proc. But if you close() after mapping it they are currently truly
inaccessible via any path, and given the language in the manpage it
seems reasonable that somebody might rely on that and be lazy with the
permissions.

I hadn't thought about O_TMPFILE thing: I'm definitely convinced now
that PTRACE_MODE_ATTACH is the right thing here. But I think having to
reopen the file saves you even if you "leak" maps of files across a call
to setresuid/etc. 

> It also might be worth checking for drivers or arch code that creates
> vmas that are backed by a different struct file than the struct file
> that was mmapped in the first place.

Interesting, I'll look into this before I resend.

Thanks,
Calvin
 
> --Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-02-04  3:53                             ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-02-04  3:53 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov,
	Linux API

On Monday 02/02 at 09:01 -0500, Austin S Hemmelgarn wrote:
> On 2015-01-30 20:58, Calvin Owens wrote:
> >On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> >>On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens@fb.com> wrote:
> >>>On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >>>>On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> >>>>
> >>>>>On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >>>>>>On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >>>>>>>Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >>>>>>>is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >>>>>>>is very useful for enumerating the files mapped into a process when
> >>>>>>>the more verbose information in /proc/<pid>/maps is not needed.
> >>>>
> >>>>This is the main (actually only) justification for the patch, and it it
> >>>>far too thin.  What does "not needed" mean.  Why can't people just use
> >>>>/proc/pid/maps?
> >>>
> >>>The biggest difference is that if you do something like this:
> >>>
> >>>         fd = open("/stuff", O_BLAH);
> >>>         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >>>         close(fd);
> >>>         unlink("/stuff");
> >>>
> >>>...then map_files/ gives you a way to get a file descriptor for
> >>>"/stuff", which you couldn't do with /proc/pid/maps.
> >>>
> >>>It's also something of a win if you just want to see what is mapped at a
> >>>specific address, since you can just readlink() the symlink for the
> >>>address range you care about and it will go grab the appropriate VMA and
> >>>give you the answer. /proc/pid/maps requires walking the VMA tree, which
> >>>is quite expensive for processes with many thousands of threads, even
> >>>without the O(N^2) issue.
> >>>
> >>>(You have to know what address range you want though, since readdir() on
> >>>map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >>>
> >>>>>>>This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >>>>>>>removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >>>>>>>the ability to ptrace the process in question, so this doesn't allow
> >>>>>>>an attacker to do anything they couldn't already do before.
> >>>>>>>
> >>>>>>>Signed-off-by: Calvin Owens <calvinowens@fb.com>
> >>>>>>
> >>>>>>Cc +linux-api@
> >>>>>
> >>>>>Looks good to me, thanks! Though I would really appreciate if someone
> >>>>>from security camp take a look as well.
> >>>>
> >>>>hm, who's that.  Kees comes to mind.
> >>>>
> >>>>And reviewers' task would be a heck of a lot easier if they knew what
> >>>>/proc/pid/map_files actually does.  This:
> >>>>
> >>>>akpm3:/usr/src/25> grep -r map_files Documentation
> >>>>akpm3:/usr/src/25>
> >>>>
> >>>>does not help.
> >>>>
> >>>>The 640708a2cff7f81 changelog says:
> >>>>
> >>>>:     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >>>>:     symlinks one for each mapping with file, the name of a symlink is
> >>>>:     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> >>>>:     results in a file that point exactly to the same inode as them vma's one.
> >>>>:
> >>>>:     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >>>>:
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >>>>
> >>>>afacit this info is also available in /proc/pid/maps, so things
> >>>>shouldn't get worse if the /proc/pid/map_files permissions are at least
> >>>>as restrictive as the /proc/pid/maps permissions.  Is that the case?
> >>>>(Please add to changelog).
> >>>
> >>>Yes, the only difference is that you can follow the link as per above.
> >>>I'll resend with a new message explaining that and the deletion thing.
> >>>
> >>>>There's one other problem here: we're assuming that the map_files
> >>>>implementation doesn't have bugs.  If it does have bugs then relaxing
> >>>>permissions like this will create new vulnerabilities.  And the
> >>>>map_files implementation is surprisingly complex.  Is it bug-free?
> >>>
> >>>While I was messing with it I used it a good bit and didn't see any
> >>>issues, although I didn't actively try to fuzz it or anything. I'd be
> >>>happy to write something to test hammering it in weird ways if you like.
> >>>I'm also happy to write testcases for namespaces.
> >>>
> >>>So far as security issues, as others have pointed out you can't follow
> >>>the links unless you can ptrace the process in question, which seems
> >>>like a pretty solid guarantee. As Cyrill pointed out in the discussion
> >>>about the documentation, that's the same protection as /proc/N/fd/*, and
> >>>those links function in the same way.
> >>
> >>My concern here is that fd/* are connected as streams, and while that
> >>has a certain level of badness as an external-to-the-process attacker,
> >>PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> >>required for access to /proc/N/mem). Since these fds are the things
> >>mapped into memory on a process, writing to them is a subset of access
> >>to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
> >
> >If you haven't done close() on a mmapped file, doesn't fd/* allow the
> >same access to the corresponding regions of memory? Or am I missing
> >something?
> >
> >But that said, I can't think of any reason making it MODE_ATTACH would
> >be a problem. Would you rather that be enforced on follow_link() like
> >the original patch did, or enforce it for the whole directory?
> >
> >
> Whole directory would probably be better, as even just the mapped
> ranges could be considered sensitive information. 

You can already get the ranges that are mapped from /proc/N/maps with
PTRACE_MODE_READ, so that part isn't new information.

> Ideally, the check should be done on both follow_link(), and the
> directory itself.

Oh, I didn't mean restricting readdir(), I meant restricting any access
through the directory similar to how the original CAP_SYS_ADMIN check
was done.
 
Thanks,
Calvin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
@ 2015-02-04  3:53                             ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-02-04  3:53 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team-b10kYP2dOMg,
	Pavel Emelyanov, Linux API

On Monday 02/02 at 09:01 -0500, Austin S Hemmelgarn wrote:
> On 2015-01-30 20:58, Calvin Owens wrote:
> >On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
> >>On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org> wrote:
> >>>On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
> >>>>On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>>>
> >>>>>On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
> >>>>>>On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
> >>>>>>>Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >>>>>>>is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
> >>>>>>>is very useful for enumerating the files mapped into a process when
> >>>>>>>the more verbose information in /proc/<pid>/maps is not needed.
> >>>>
> >>>>This is the main (actually only) justification for the patch, and it it
> >>>>far too thin.  What does "not needed" mean.  Why can't people just use
> >>>>/proc/pid/maps?
> >>>
> >>>The biggest difference is that if you do something like this:
> >>>
> >>>         fd = open("/stuff", O_BLAH);
> >>>         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
> >>>         close(fd);
> >>>         unlink("/stuff");
> >>>
> >>>...then map_files/ gives you a way to get a file descriptor for
> >>>"/stuff", which you couldn't do with /proc/pid/maps.
> >>>
> >>>It's also something of a win if you just want to see what is mapped at a
> >>>specific address, since you can just readlink() the symlink for the
> >>>address range you care about and it will go grab the appropriate VMA and
> >>>give you the answer. /proc/pid/maps requires walking the VMA tree, which
> >>>is quite expensive for processes with many thousands of threads, even
> >>>without the O(N^2) issue.
> >>>
> >>>(You have to know what address range you want though, since readdir() on
> >>>map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
> >>>
> >>>>>>>This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> >>>>>>>removes the CAP_SYS_ADMIN restrictions. Following the links requires
> >>>>>>>the ability to ptrace the process in question, so this doesn't allow
> >>>>>>>an attacker to do anything they couldn't already do before.
> >>>>>>>
> >>>>>>>Signed-off-by: Calvin Owens <calvinowens-b10kYP2dOMg@public.gmane.org>
> >>>>>>
> >>>>>>Cc +linux-api@
> >>>>>
> >>>>>Looks good to me, thanks! Though I would really appreciate if someone
> >>>>>from security camp take a look as well.
> >>>>
> >>>>hm, who's that.  Kees comes to mind.
> >>>>
> >>>>And reviewers' task would be a heck of a lot easier if they knew what
> >>>>/proc/pid/map_files actually does.  This:
> >>>>
> >>>>akpm3:/usr/src/25> grep -r map_files Documentation
> >>>>akpm3:/usr/src/25>
> >>>>
> >>>>does not help.
> >>>>
> >>>>The 640708a2cff7f81 changelog says:
> >>>>
> >>>>:     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
> >>>>:     symlinks one for each mapping with file, the name of a symlink is
> >>>>:     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
> >>>>:     results in a file that point exactly to the same inode as them vma's one.
> >>>>:
> >>>>:     For example the ls -l of some arbitrary /proc/<pid>/map_files/
> >>>>:
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
> >>>>:      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> >>>>
> >>>>afacit this info is also available in /proc/pid/maps, so things
> >>>>shouldn't get worse if the /proc/pid/map_files permissions are at least
> >>>>as restrictive as the /proc/pid/maps permissions.  Is that the case?
> >>>>(Please add to changelog).
> >>>
> >>>Yes, the only difference is that you can follow the link as per above.
> >>>I'll resend with a new message explaining that and the deletion thing.
> >>>
> >>>>There's one other problem here: we're assuming that the map_files
> >>>>implementation doesn't have bugs.  If it does have bugs then relaxing
> >>>>permissions like this will create new vulnerabilities.  And the
> >>>>map_files implementation is surprisingly complex.  Is it bug-free?
> >>>
> >>>While I was messing with it I used it a good bit and didn't see any
> >>>issues, although I didn't actively try to fuzz it or anything. I'd be
> >>>happy to write something to test hammering it in weird ways if you like.
> >>>I'm also happy to write testcases for namespaces.
> >>>
> >>>So far as security issues, as others have pointed out you can't follow
> >>>the links unless you can ptrace the process in question, which seems
> >>>like a pretty solid guarantee. As Cyrill pointed out in the discussion
> >>>about the documentation, that's the same protection as /proc/N/fd/*, and
> >>>those links function in the same way.
> >>
> >>My concern here is that fd/* are connected as streams, and while that
> >>has a certain level of badness as an external-to-the-process attacker,
> >>PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
> >>required for access to /proc/N/mem). Since these fds are the things
> >>mapped into memory on a process, writing to them is a subset of access
> >>to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
> >
> >If you haven't done close() on a mmapped file, doesn't fd/* allow the
> >same access to the corresponding regions of memory? Or am I missing
> >something?
> >
> >But that said, I can't think of any reason making it MODE_ATTACH would
> >be a problem. Would you rather that be enforced on follow_link() like
> >the original patch did, or enforce it for the whole directory?
> >
> >
> Whole directory would probably be better, as even just the mapped
> ranges could be considered sensitive information. 

You can already get the ranges that are mapped from /proc/N/maps with
PTRACE_MODE_READ, so that part isn't new information.

> Ideally, the check should be done on both follow_link(), and the
> directory itself.

Oh, I didn't mean restricting readdir(), I meant restricting any access
through the directory similar to how the original CAP_SYS_ADMIN check
was done.
 
Thanks,
Calvin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [RFC][PATCH v3] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-01-24  3:15           ` [RFC][PATCH v2] " Calvin Owens
  2015-01-26 12:47             ` Kirill A. Shutemov
@ 2015-02-12  2:29             ` Calvin Owens
  2015-02-12  7:45               ` Cyrill Gorcunov
                                 ` (2 more replies)
  1 sibling, 3 replies; 80+ messages in thread
From: Calvin Owens @ 2015-02-12  2:29 UTC (permalink / raw)
  To: Cyrill Gorcunov, Kirill A. Shutemov, Andrew Morton
  Cc: Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, linux-kernel, kernel-team, Pavel Emelyanov

Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
useful for enumerating the files mapped into a process when the more
verbose information in /proc/<pid>/maps is not needed. It also allows
access to file descriptors for files that have been deleted and closed
but are still mmapped into a process, which can be very useful for
introspection and debugging.

This patch moves the folder out from behind CHECKPOINT_RESTORE, and
removes the CAP_SYS_ADMIN restrictions. With that change alone,
accessing this interface would have required PTRACE_MODE_READ like the
links in /proc/<pid>/fd/*.

However, a discussion on lkml concluded that MODE_READ is not
sufficient, both because write access to the inodes these links point
to allows direct modification of a process's address space, and
because it exposes files that users may have overlooked permissions on
because it was assumed they would be inaccessible (either deleted as
per above, or created via O_TMPFILE).

So, in addition to the above, this patch enforces PTRACE_MODE_ATTACH on
all the map_files/ operations. Since this is the same check that
determines if access to /proc/<pid>/mem is allowed, it will not allow an
attacker to do anything that was not already possible through that
interface.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
Changes in v3:	Changed permission checks to use PTRACE_MODE_ATTACH
		instead of PTRACE_MODE_READ, and added a stub to
		enforce MODE_ATTACH on follow_link() as well.

Changes in v2:	Removed the follow_link() stub that returned -EPERM if
		the caller didn't have CAP_SYS_ADMIN, since the caller
		in my chroot() scenario gets -EACCES anyway.

 fs/proc/base.c | 59 ++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..1355a4d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1632,8 +1632,6 @@ end_instantiate:
 	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-
 /*
  * dname_to_vma_addr - maps a dentry name into two unsigned longs
  * which represent vma start and end addresses.
@@ -1660,17 +1658,12 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	if (!capable(CAP_SYS_ADMIN)) {
-		status = -EPERM;
-		goto out_notask;
-	}
-
 	inode = dentry->d_inode;
 	task = get_proc_task(inode);
 	if (!task)
 		goto out_notask;
 
-	mm = mm_access(task, PTRACE_MODE_READ);
+	mm = mm_access(task, PTRACE_MODE_ATTACH);
 	if (IS_ERR_OR_NULL(mm))
 		goto out;
 
@@ -1753,6 +1746,39 @@ struct map_files_info {
 	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
 };
 
+/*
+ * Enforce stronger PTRACE_MODE_ATTACH permissions on the symlinks under
+ * /proc/<pid>/map_files, since these links may refer to deleted or O_TMPFILE
+ * files that users might assume are inaccessible regardless of their
+ * ownership/permissions.
+ */
+static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	struct inode *inode = dentry->d_inode;
+	struct task_struct *task;
+	int allowed = 0;
+
+	task = get_proc_task(inode);
+	if (task) {
+		allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH);
+		put_task_struct(task);
+	}
+
+	if (!allowed)
+		return ERR_PTR(-EACCES);
+
+	return proc_pid_follow_link(dentry, nd);
+}
+
+/*
+ * Identical to proc_pid_link_inode_operations except for follow_link()
+ */
+static const struct inode_operations proc_map_files_link_inode_operations = {
+	.readlink	= proc_pid_readlink,
+	.follow_link	= proc_map_files_follow_link,
+	.setattr	= proc_setattr,
+};
+
 static int
 proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 			   struct task_struct *task, const void *ptr)
@@ -1768,7 +1794,7 @@ proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 	ei = PROC_I(inode);
 	ei->op.proc_get_link = proc_map_files_get_link;
 
-	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_op = &proc_map_files_link_inode_operations;
 	inode->i_size = 64;
 	inode->i_mode = S_IFLNK;
 
@@ -1792,17 +1818,13 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	int result;
 	struct mm_struct *mm;
 
-	result = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	result = -ENOENT;
 	task = get_proc_task(dir);
 	if (!task)
 		goto out;
 
 	result = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
 		goto out_put_task;
 
 	result = -ENOENT;
@@ -1849,17 +1871,13 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	struct map_files_info *p;
 	int ret;
 
-	ret = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	ret = -ENOENT;
 	task = get_proc_task(file_inode(file));
 	if (!task)
 		goto out;
 
 	ret = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
 		goto out_put_task;
 
 	ret = 0;
@@ -2040,7 +2058,6 @@ static const struct file_operations proc_timers_operations = {
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
-#endif /* CONFIG_CHECKPOINT_RESTORE */
 
 static int proc_pident_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
@@ -2537,9 +2554,7 @@ static const struct inode_operations proc_task_inode_operations;
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-#ifdef CONFIG_CHECKPOINT_RESTORE
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-#endif
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v3] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-02-12  2:29             ` [RFC][PATCH v3] " Calvin Owens
@ 2015-02-12  7:45               ` Cyrill Gorcunov
  2015-02-14 20:40               ` [RFC][PATCH v4] " Calvin Owens
  2015-02-14 20:44               ` [PATCH] procfs: Return -ESRCH on /proc/N/fd/* when PID N doesn't exist Calvin Owens
  2 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-02-12  7:45 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Kirill A. Shutemov, Andrew Morton, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov

On Wed, Feb 11, 2015 at 06:29:10PM -0800, Calvin Owens wrote:
> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
> only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
> useful for enumerating the files mapped into a process when the more
> verbose information in /proc/<pid>/maps is not needed. It also allows
> access to file descriptors for files that have been deleted and closed
> but are still mmapped into a process, which can be very useful for
> introspection and debugging.
...
>  
> +/*
> + * Enforce stronger PTRACE_MODE_ATTACH permissions on the symlinks under
> + * /proc/<pid>/map_files, since these links may refer to deleted or O_TMPFILE
> + * files that users might assume are inaccessible regardless of their
> + * ownership/permissions.
> + */
> +static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
> +{
> +	struct inode *inode = dentry->d_inode;
> +	struct task_struct *task;
> +	int allowed = 0;
> +
> +	task = get_proc_task(inode);
> +	if (task) {
> +		allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH);
> +		put_task_struct(task);
> +	}

	else
		return ERR_PTR(-ESRCH);

Other than that, looks good to me, thanks!

Rewieved-by: Cyrill Gorcunov <gorcunov@openvz.org>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [RFC][PATCH v4] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-02-12  2:29             ` [RFC][PATCH v3] " Calvin Owens
  2015-02-12  7:45               ` Cyrill Gorcunov
@ 2015-02-14 20:40               ` Calvin Owens
  2015-03-10 22:17                 ` Cyrill Gorcunov
  2015-05-19  3:10                 ` [PATCH v5] " Calvin Owens
  2015-02-14 20:44               ` [PATCH] procfs: Return -ESRCH on /proc/N/fd/* when PID N doesn't exist Calvin Owens
  2 siblings, 2 replies; 80+ messages in thread
From: Calvin Owens @ 2015-02-14 20:40 UTC (permalink / raw)
  To: Cyrill Gorcunov, Kirill A. Shutemov, Andrew Morton
  Cc: Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, linux-kernel, kernel-team, Pavel Emelyanov

Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
very useful for enumerating the files mapped into a process when the
more verbose information in /proc/<pid>/maps is not needed. It also
allows access to file descriptors for files that have been deleted and
closed but are still mmapped into a process, which can be very useful
for introspection and debugging.

This patch moves the folder out from behind CHECKPOINT_RESTORE, and
removes the CAP_SYS_ADMIN restrictions. With that change alone,
following the links would have required PTRACE_MODE_READ like the
links in /proc/<pid>/fd/*.

However, a discussion on lkml concluded that MODE_READ is not
sufficient, both because write access to the inodes these links point
to allows direct modification of a process's address space, and
because it exposes files that users may have overlooked permissions on
because it was assumed they would be inaccessible (either deleted as
per above, or created via O_TMPFILE).

So, in addition to the above, this patch enforces PTRACE_MODE_ATTACH on
all the map_files operations. Since this is the same check that
determines if access to /proc/<pid>/mem is allowed, it will not allow an
attacker to do anything that was not already possible through that
interface.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
Changes in v4:	Return -ESRCH from follow_link() when get_proc_task()
		returns NULL.

Changes in v3:	Changed permission checks to use PTRACE_MODE_ATTACH
		instead of PTRACE_MODE_READ, and added a stub to
		enforce MODE_ATTACH on follow_link() as well.

Changes in v2:	Removed the follow_link() stub that returned -EPERM if
		the caller didn't have CAP_SYS_ADMIN, since the caller
		in my chroot() scenario gets -EACCES anyway.

 fs/proc/base.c | 61 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..b918692 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1632,8 +1632,6 @@ end_instantiate:
 	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-
 /*
  * dname_to_vma_addr - maps a dentry name into two unsigned longs
  * which represent vma start and end addresses.
@@ -1660,17 +1658,12 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	if (!capable(CAP_SYS_ADMIN)) {
-		status = -EPERM;
-		goto out_notask;
-	}
-
 	inode = dentry->d_inode;
 	task = get_proc_task(inode);
 	if (!task)
 		goto out_notask;
 
-	mm = mm_access(task, PTRACE_MODE_READ);
+	mm = mm_access(task, PTRACE_MODE_ATTACH);
 	if (IS_ERR_OR_NULL(mm))
 		goto out;
 
@@ -1753,6 +1746,41 @@ struct map_files_info {
 	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
 };
 
+/*
+ * Enforce stronger PTRACE_MODE_ATTACH permissions on the symlinks under
+ * /proc/<pid>/map_files, since these links may refer to deleted or O_TMPFILE
+ * files that users might assume are inaccessible regardless of their
+ * ownership/permissions.
+ */
+static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	struct inode *inode = dentry->d_inode;
+	struct task_struct *task;
+	int allowed = 0;
+
+	task = get_proc_task(inode);
+	if (task) {
+		allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH);
+		put_task_struct(task);
+	} else {
+		return ERR_PTR(-ESRCH);
+	}
+
+	if (!allowed)
+		return ERR_PTR(-EACCES);
+
+	return proc_pid_follow_link(dentry, nd);
+}
+
+/*
+ * Identical to proc_pid_link_inode_operations except for follow_link()
+ */
+static const struct inode_operations proc_map_files_link_inode_operations = {
+	.readlink	= proc_pid_readlink,
+	.follow_link	= proc_map_files_follow_link,
+	.setattr	= proc_setattr,
+};
+
 static int
 proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 			   struct task_struct *task, const void *ptr)
@@ -1768,7 +1796,7 @@ proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 	ei = PROC_I(inode);
 	ei->op.proc_get_link = proc_map_files_get_link;
 
-	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_op = &proc_map_files_link_inode_operations;
 	inode->i_size = 64;
 	inode->i_mode = S_IFLNK;
 
@@ -1792,17 +1820,13 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	int result;
 	struct mm_struct *mm;
 
-	result = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	result = -ENOENT;
 	task = get_proc_task(dir);
 	if (!task)
 		goto out;
 
 	result = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
 		goto out_put_task;
 
 	result = -ENOENT;
@@ -1849,17 +1873,13 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	struct map_files_info *p;
 	int ret;
 
-	ret = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	ret = -ENOENT;
 	task = get_proc_task(file_inode(file));
 	if (!task)
 		goto out;
 
 	ret = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
 		goto out_put_task;
 
 	ret = 0;
@@ -2040,7 +2060,6 @@ static const struct file_operations proc_timers_operations = {
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
-#endif /* CONFIG_CHECKPOINT_RESTORE */
 
 static int proc_pident_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
@@ -2537,9 +2556,7 @@ static const struct inode_operations proc_task_inode_operations;
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-#ifdef CONFIG_CHECKPOINT_RESTORE
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-#endif
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH] procfs: Return -ESRCH on /proc/N/fd/* when PID N doesn't exist
  2015-02-12  2:29             ` [RFC][PATCH v3] " Calvin Owens
  2015-02-12  7:45               ` Cyrill Gorcunov
  2015-02-14 20:40               ` [RFC][PATCH v4] " Calvin Owens
@ 2015-02-14 20:44               ` Calvin Owens
  2 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-02-14 20:44 UTC (permalink / raw)
  To: Cyrill Gorcunov, Kirill A. Shutemov, Andrew Morton
  Cc: Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, linux-kernel, kernel-team, Pavel Emelyanov

Currently, readlink() and follow_link() for the symbolic links in
/proc/<pid>/fd/* will return -EACCES in the case where looking up the
task finds that it does not exist.

This patch inlines the logic from proc_fd_access_allowed() into these
two functions such that they will return -ESRCH if the lookup in /proc
races with the task exiting. Since those were the only two callers of
that helper function, it also removes it.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
 fs/proc/base.c | 47 ++++++++++++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..308fcbd 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -485,23 +485,6 @@ static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 /*                       Here the fs part begins                        */
 /************************************************************************/
 
-/* permission checks */
-static int proc_fd_access_allowed(struct inode *inode)
-{
-	struct task_struct *task;
-	int allowed = 0;
-	/* Allow access to a task's file descriptors if it is us or we
-	 * may use ptrace attach to the process and find out that
-	 * information.
-	 */
-	task = get_proc_task(inode);
-	if (task) {
-		allowed = ptrace_may_access(task, PTRACE_MODE_READ);
-		put_task_struct(task);
-	}
-	return allowed;
-}
-
 int proc_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	int error;
@@ -1375,10 +1358,21 @@ static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
 	struct path path;
-	int error = -EACCES;
+	int error = -ESRCH;
+	int allowed = 0;
+	struct task_struct *task;
 
 	/* Are we allowed to snoop on the tasks file descriptors? */
-	if (!proc_fd_access_allowed(inode))
+	task = get_proc_task(inode);
+	if (task) {
+		allowed = ptrace_may_access(task, PTRACE_MODE_READ);
+		put_task_struct(task);
+	} else {
+		goto out;
+	}
+
+	error = -EACCES;
+	if (!allowed)
 		goto out;
 
 	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
@@ -1417,12 +1411,23 @@ static int do_proc_readlink(struct path *path, char __user *buffer, int buflen)
 
 static int proc_pid_readlink(struct dentry * dentry, char __user * buffer, int buflen)
 {
-	int error = -EACCES;
+	int error = -ESRCH;
+	int allowed = 0;
+	struct task_struct *task;
 	struct inode *inode = dentry->d_inode;
 	struct path path;
 
 	/* Are we allowed to snoop on the tasks file descriptors? */
-	if (!proc_fd_access_allowed(inode))
+	task = get_proc_task(inode);
+	if (task) {
+		allowed = ptrace_may_access(task, PTRACE_MODE_READ);
+		put_task_struct(task);
+	} else {
+		goto out;
+	}
+
+	error = -EACCES;
+	if (!allowed)
 		goto out;
 
 	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v4] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-02-14 20:40               ` [RFC][PATCH v4] " Calvin Owens
@ 2015-03-10 22:17                 ` Cyrill Gorcunov
  2015-04-28 22:23                   ` Calvin Owens
  2015-05-19  3:10                 ` [PATCH v5] " Calvin Owens
  1 sibling, 1 reply; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-03-10 22:17 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Kirill A. Shutemov, Andrew Morton, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov

On Sat, Feb 14, 2015 at 12:40:09PM -0800, Calvin Owens wrote:
> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
> very useful for enumerating the files mapped into a process when the
> more verbose information in /proc/<pid>/maps is not needed. It also
> allows access to file descriptors for files that have been deleted and
> closed but are still mmapped into a process, which can be very useful
> for introspection and debugging.

Guys, I'm really-really sorry for not replying the email that long.
If I understand correctly all concerns were addressed, right?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v4] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-03-10 22:17                 ` Cyrill Gorcunov
@ 2015-04-28 22:23                   ` Calvin Owens
  2015-04-29  7:32                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 80+ messages in thread
From: Calvin Owens @ 2015-04-28 22:23 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kirill A. Shutemov, Andrew Morton, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov, calvinowens

On Wednesday 03/11 at 01:17 +0300, Cyrill Gorcunov wrote:
> On Sat, Feb 14, 2015 at 12:40:09PM -0800, Calvin Owens wrote:
> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
> > very useful for enumerating the files mapped into a process when the
> > more verbose information in /proc/<pid>/maps is not needed. It also
> > allows access to file descriptors for files that have been deleted and
> > closed but are still mmapped into a process, which can be very useful
> > for introspection and debugging.
> 
> Guys, I'm really-really sorry for not replying the email that long.
> If I understand correctly all concerns were addressed, right?

Ping!

I thought everybody was happy after the permission check was changed
to be PTRACE_MODE_ATTACH. But I'll resend one more time if you prefer.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [RFC][PATCH v4] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-04-28 22:23                   ` Calvin Owens
@ 2015-04-29  7:32                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-04-29  7:32 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Kirill A. Shutemov, Andrew Morton, Alexey Dobriyan,
	Oleg Nesterov, Eric W. Biederman, Al Viro, Kirill A. Shutemov,
	Peter Feiner, Grant Likely, Siddhesh Poyarekar, linux-kernel,
	kernel-team, Pavel Emelyanov

On Tue, Apr 28, 2015 at 03:23:53PM -0700, Calvin Owens wrote:
> > 
> > Guys, I'm really-really sorry for not replying the email that long.
> > If I understand correctly all concerns were addressed, right?
> 
> Ping!
> 
> I thought everybody was happy after the permission check was changed
> to be PTRACE_MODE_ATTACH. But I'll resend one more time if you prefer.

Yes please. I thought that PTRACE_MODE_ATTACH suits all as well, but
lets give it another review shot.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v5] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-02-14 20:40               ` [RFC][PATCH v4] " Calvin Owens
  2015-03-10 22:17                 ` Cyrill Gorcunov
@ 2015-05-19  3:10                 ` Calvin Owens
  2015-05-19  3:29                   ` Joe Perches
                                     ` (2 more replies)
  1 sibling, 3 replies; 80+ messages in thread
From: Calvin Owens @ 2015-05-19  3:10 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Eric W. Biederman, Al Viro,
	Miklos Szeredi, Zefan Li, Oleg Nesterov, Joe Perches,
	David Howells
  Cc: Calvin Owens, linux-kernel, kernel-team, Andy Lutomirski,
	Kees Cook, Kirill A. Shutemov

Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
very useful for enumerating the files mapped into a process when the
more verbose information in /proc/<pid>/maps is not needed. It also
allows access to file descriptors for files that have been deleted and
closed but are still mmapped into a process, which can be very useful
for introspection and debugging.

This patch moves the folder out from behind CHECKPOINT_RESTORE, and
removes the CAP_SYS_ADMIN restrictions. With that change alone,
following the links would have required PTRACE_MODE_READ like the
links in /proc/<pid>/fd/*.

However, a discussion on lkml concluded that MODE_READ is not
sufficient, both because write access to the inodes these links point
to allows direct modification of a process's address space, and
because it exposes files that users may have overlooked permissions on
because it was assumed they would be inaccessible (either deleted as
per above, or created via O_TMPFILE).

So, in addition to the above, this patch enforces PTRACE_MODE_ATTACH on
all the map_files operations. Since this is the same check that
determines if access to /proc/<pid>/mem is allowed, it will not allow an
attacker to do anything that was not already possible through that
interface.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
Changes in v5:	s/dentry->d_inode/d_inode(dentry)/g

Changes in v4:	Return -ESRCH from follow_link() when get_proc_task()
		returns NULL.

Changes in v3:	Changed permission checks to use PTRACE_MODE_ATTACH
		instead of PTRACE_MODE_READ, and added a stub to
		enforce MODE_ATTACH on follow_link() as well.

Changes in v2:	Removed the follow_link() stub that returned -EPERM if
		the caller didn't have CAP_SYS_ADMIN, since the caller
		in my chroot() scenario gets -EACCES anyway.

 fs/proc/base.c | 61 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 093ca14..22d95a7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1641,8 +1641,6 @@ end_instantiate:
 	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-
 /*
  * dname_to_vma_addr - maps a dentry name into two unsigned longs
  * which represent vma start and end addresses.
@@ -1669,17 +1667,12 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	if (!capable(CAP_SYS_ADMIN)) {
-		status = -EPERM;
-		goto out_notask;
-	}
-
 	inode = d_inode(dentry);
 	task = get_proc_task(inode);
 	if (!task)
 		goto out_notask;
 
-	mm = mm_access(task, PTRACE_MODE_READ);
+	mm = mm_access(task, PTRACE_MODE_ATTACH);
 	if (IS_ERR_OR_NULL(mm))
 		goto out;
 
@@ -1762,6 +1755,41 @@ struct map_files_info {
 	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
 };
 
+/*
+ * Enforce stronger PTRACE_MODE_ATTACH permissions on the symlinks under
+ * /proc/<pid>/map_files, since these links may refer to deleted or O_TMPFILE
+ * files that users might assume are inaccessible regardless of their
+ * ownership/permissions.
+ */
+static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	struct inode *inode = d_inode(dentry);
+	struct task_struct *task;
+	int allowed = 0;
+
+	task = get_proc_task(inode);
+	if (task) {
+		allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH);
+		put_task_struct(task);
+	} else {
+		return ERR_PTR(-ESRCH);
+	}
+
+	if (!allowed)
+		return ERR_PTR(-EACCES);
+
+	return proc_pid_follow_link(dentry, nd);
+}
+
+/*
+ * Identical to proc_pid_link_inode_operations except for follow_link()
+ */
+static const struct inode_operations proc_map_files_link_inode_operations = {
+	.readlink	= proc_pid_readlink,
+	.follow_link	= proc_map_files_follow_link,
+	.setattr	= proc_setattr,
+};
+
 static int
 proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 			   struct task_struct *task, const void *ptr)
@@ -1777,7 +1805,7 @@ proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 	ei = PROC_I(inode);
 	ei->op.proc_get_link = proc_map_files_get_link;
 
-	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_op = &proc_map_files_link_inode_operations;
 	inode->i_size = 64;
 	inode->i_mode = S_IFLNK;
 
@@ -1801,17 +1829,13 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	int result;
 	struct mm_struct *mm;
 
-	result = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	result = -ENOENT;
 	task = get_proc_task(dir);
 	if (!task)
 		goto out;
 
 	result = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
 		goto out_put_task;
 
 	result = -ENOENT;
@@ -1858,17 +1882,13 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	struct map_files_info *p;
 	int ret;
 
-	ret = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	ret = -ENOENT;
 	task = get_proc_task(file_inode(file));
 	if (!task)
 		goto out;
 
 	ret = -EACCES;
-	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+	if (!ptrace_may_access(task, PTRACE_MODE_ATTACH))
 		goto out_put_task;
 
 	ret = 0;
@@ -2050,7 +2070,6 @@ static const struct file_operations proc_timers_operations = {
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
-#endif /* CONFIG_CHECKPOINT_RESTORE */
 
 static int proc_pident_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
@@ -2549,9 +2568,7 @@ static const struct inode_operations proc_task_inode_operations;
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-#ifdef CONFIG_CHECKPOINT_RESTORE
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-#endif
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v5] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-05-19  3:10                 ` [PATCH v5] " Calvin Owens
@ 2015-05-19  3:29                   ` Joe Perches
  2015-05-19 18:04                   ` Andy Lutomirski
  2015-06-09  3:39                   ` [PATCH v6] " Calvin Owens
  2 siblings, 0 replies; 80+ messages in thread
From: Joe Perches @ 2015-05-19  3:29 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Alexey Dobriyan, Eric W. Biederman, Al Viro,
	Miklos Szeredi, Zefan Li, Oleg Nesterov, David Howells,
	linux-kernel, kernel-team, Andy Lutomirski, Kees Cook,
	Kirill A. Shutemov

On Mon, 2015-05-18 at 20:10 -0700, Calvin Owens wrote:
> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
> very useful for enumerating the files mapped into a process when the
> more verbose information in /proc/<pid>/maps is not needed. It also
> allows access to file descriptors for files that have been deleted and
> closed but are still mmapped into a process, which can be very useful
> for introspection and debugging.

style trivia:

> diff --git a/fs/proc/base.c b/fs/proc/base.c
[]
> +/*
> + * Enforce stronger PTRACE_MODE_ATTACH permissions on the symlinks under
> + * /proc/<pid>/map_files, since these links may refer to deleted or O_TMPFILE
> + * files that users might assume are inaccessible regardless of their
> + * ownership/permissions.
> + */
> +static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
> +{
> +	struct inode *inode = d_inode(dentry);
> +	struct task_struct *task;
> +	int allowed = 0;
> +
> +	task = get_proc_task(inode);
> +	if (task) {
> +		allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH);
> +		put_task_struct(task);
> +	} else {
> +		return ERR_PTR(-ESRCH);
> +	}
> +
> +	if (!allowed)
> +		return ERR_PTR(-EACCES);
> +
> +	return proc_pid_follow_link(dentry, nd);
> +}

It'd perhaps be clearer to read this with an
immediate return after a failure in get_proc_task.

Maybe something like (move initializations as desired):

static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
{
	int allowed;
	struct iode *inode = d_inode(dentry);
	struct task_struct task = get_proc_task(inode);

	if (!task)
		return ERR_PTR(-ESRCH);

	allowed = ptrace_may_access(task, PTRACE_MODE_ATTACH);

	put_task_struct(task);

	if (!allowed)
		return ERR_PTR(-EACCES);

	return proc_pic_follow_link(dentry, nd);
}



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-05-19  3:10                 ` [PATCH v5] " Calvin Owens
  2015-05-19  3:29                   ` Joe Perches
@ 2015-05-19 18:04                   ` Andy Lutomirski
  2015-05-21  1:52                     ` Calvin Owens
  2015-06-09  3:39                   ` [PATCH v6] " Calvin Owens
  2 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2015-05-19 18:04 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Alexey Dobriyan, Eric W. Biederman, Al Viro,
	Miklos Szeredi, Zefan Li, Oleg Nesterov, Joe Perches,
	David Howells, linux-kernel, kernel-team, Kees Cook,
	Kirill A. Shutemov

On Mon, May 18, 2015 at 8:10 PM, Calvin Owens <calvinowens@fb.com> wrote:
> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
> very useful for enumerating the files mapped into a process when the
> more verbose information in /proc/<pid>/maps is not needed. It also
> allows access to file descriptors for files that have been deleted and
> closed but are still mmapped into a process, which can be very useful
> for introspection and debugging.
>
> This patch moves the folder out from behind CHECKPOINT_RESTORE, and

I'm fine with this.

> removes the CAP_SYS_ADMIN restrictions. With that change alone,
> following the links would have required PTRACE_MODE_READ like the
> links in /proc/<pid>/fd/*.

I'm still not at all convinced that this is safe.  Here are a few ways
that it could have unintended consequences:

1. Mmap a dma-buf and then open /proc/self/map_files/addr.  You get an
fd pointing at a different inode than you mapped.  (kdbus would have
the same problem if it were merged.)

2. Open a file with O_RDONLY, mmap it with PROT_READ, close the file,
then open /proc/self/map_files/addr with O_RDWR.  I don't see anything
preventing that from succeeding.

3. Open a file, mmap it, close the fd, chroot, drop privileges, open
/proc/self/map_files/addr, then call ftruncate.

So NAK as-is, I think.

Fixing #1 would involve changing the way mmap works, I think.  Fixing
#2 would require similar infrastructure to what we'd need to fix the
existing /proc/pid/fd mode holes.  I have no clue how to even approach
fixing #3.

What's the use case of this patch?

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-05-19 18:04                   ` Andy Lutomirski
@ 2015-05-21  1:52                     ` Calvin Owens
  2015-05-21  2:10                       ` Andy Lutomirski
  0 siblings, 1 reply; 80+ messages in thread
From: Calvin Owens @ 2015-05-21  1:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Alexey Dobriyan, Eric W. Biederman, Al Viro,
	Miklos Szeredi, Zefan Li, Oleg Nesterov, Joe Perches,
	David Howells, linux-kernel, kernel-team, Kees Cook,
	Kirill A. Shutemov, calvinowens

On Tuesday 05/19 at 11:04 -0700, Andy Lutomirski wrote:
> On Mon, May 18, 2015 at 8:10 PM, Calvin Owens <calvinowens@fb.com> wrote:
> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
> > very useful for enumerating the files mapped into a process when the
> > more verbose information in /proc/<pid>/maps is not needed. It also
> > allows access to file descriptors for files that have been deleted and
> > closed but are still mmapped into a process, which can be very useful
> > for introspection and debugging.
> >
> > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
> 
> I'm fine with this.
> 
> > removes the CAP_SYS_ADMIN restrictions. With that change alone,
> > following the links would have required PTRACE_MODE_READ like the
> > links in /proc/<pid>/fd/*.
> 
> I'm still not at all convinced that this is safe.  Here are a few ways
> that it could have unintended consequences:
>
> 1. Mmap a dma-buf and then open /proc/self/map_files/addr.  You get an
> fd pointing at a different inode than you mapped.  (kdbus would have
> the same problem if it were merged.)
>
> 2. Open a file with O_RDONLY, mmap it with PROT_READ, close the file,
> then open /proc/self/map_files/addr with O_RDWR.  I don't see anything
> preventing that from succeeding.

Hmm, that's a good point: it lets you bypass the permission checks on
all the path components you would normally walk through to get to the
file. But it still only works if you actually have permission to open
the file in question for writing.

Also, this is already how the /proc/N/fd/* symlinks work, isn't it?

> 3. Open a file, mmap it, close the fd, chroot, drop privileges, open
> /proc/self/map_files/addr, then call ftruncate.

This doesn't work unless the privileges you dropped to actually allow
you to open the mmapped file for writing. It's really the same
fundamental problem as (2), where you're allowing direct access to a
file without trying to walk the path down to it, right?

> So NAK as-is, I think.

Limiting ->follow_link() to CAP_SYS_ADMIN wouldn't affect anything I
imagine using this interface for (see below), so I have no problem with
putting that back in. I think that would alleviate all your concerns
above, right?

(That said, I don't think it makes sense to limit readdir() or
readlink() on map_files/* to CAP_SYS_ADMIN, since that alone is a subset
of what you can get from /proc/N/maps.)

> Fixing #1 would involve changing the way mmap works, I think.  Fixing
> #2 would require similar infrastructure to what we'd need to fix the
> existing /proc/pid/fd mode holes.  I have no clue how to even approach
> fixing #3.
>
> What's the use case of this patch?

The biggest use case: it enables you to stat() files that have been
deleted but are still mapped by some process.

This enables a much quicker and more accurate answer to the question
"How much disk space is being consumed by files that are deleted but
still mapped?" than is currently possible.
 
It also allows you to know how much space a specific mapped-but-deleted
file is using on a specific filesystem, which is currently impossible
from userspace AFAIK.

Thanks,
Calvin

> --Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v5] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-05-21  1:52                     ` Calvin Owens
@ 2015-05-21  2:10                       ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2015-05-21  2:10 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Alexey Dobriyan, Eric W. Biederman, Al Viro,
	Miklos Szeredi, Zefan Li, Oleg Nesterov, Joe Perches,
	David Howells, linux-kernel, kernel-team, Kees Cook,
	Kirill A. Shutemov

On Wed, May 20, 2015 at 6:52 PM, Calvin Owens <calvinowens@fb.com> wrote:
> On Tuesday 05/19 at 11:04 -0700, Andy Lutomirski wrote:
>> On Mon, May 18, 2015 at 8:10 PM, Calvin Owens <calvinowens@fb.com> wrote:
>> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface is
>> > very useful for enumerating the files mapped into a process when the
>> > more verbose information in /proc/<pid>/maps is not needed. It also
>> > allows access to file descriptors for files that have been deleted and
>> > closed but are still mmapped into a process, which can be very useful
>> > for introspection and debugging.
>> >
>> > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>>
>> I'm fine with this.
>>
>> > removes the CAP_SYS_ADMIN restrictions. With that change alone,
>> > following the links would have required PTRACE_MODE_READ like the
>> > links in /proc/<pid>/fd/*.
>>
>> I'm still not at all convinced that this is safe.  Here are a few ways
>> that it could have unintended consequences:
>>
>> 1. Mmap a dma-buf and then open /proc/self/map_files/addr.  You get an
>> fd pointing at a different inode than you mapped.  (kdbus would have
>> the same problem if it were merged.)
>>
>> 2. Open a file with O_RDONLY, mmap it with PROT_READ, close the file,
>> then open /proc/self/map_files/addr with O_RDWR.  I don't see anything
>> preventing that from succeeding.
>
> Hmm, that's a good point: it lets you bypass the permission checks on
> all the path components you would normally walk through to get to the
> file. But it still only works if you actually have permission to open
> the file in question for writing.

But you might not still have that permission.

>
> Also, this is already how the /proc/N/fd/* symlinks work, isn't it?

Yes, but only for files that are open.  Also, I hope to fix that some day.

>
>> 3. Open a file, mmap it, close the fd, chroot, drop privileges, open
>> /proc/self/map_files/addr, then call ftruncate.
>
> This doesn't work unless the privileges you dropped to actually allow
> you to open the mmapped file for writing. It's really the same
> fundamental problem as (2), where you're allowing direct access to a
> file without trying to walk the path down to it, right?

Yes, although I can imagine this actually happening.  Also, there's issue #1.

>
>> So NAK as-is, I think.
>
> Limiting ->follow_link() to CAP_SYS_ADMIN wouldn't affect anything I
> imagine using this interface for (see below), so I have no problem with
> putting that back in. I think that would alleviate all your concerns
> above, right?

I think so.  You could still maybe do awful things due to #1, but at
least you'd have to be privileged.

>
> (That said, I don't think it makes sense to limit readdir() or
> readlink() on map_files/* to CAP_SYS_ADMIN, since that alone is a subset
> of what you can get from /proc/N/maps.)

Agreed.

>
>> Fixing #1 would involve changing the way mmap works, I think.  Fixing
>> #2 would require similar infrastructure to what we'd need to fix the
>> existing /proc/pid/fd mode holes.  I have no clue how to even approach
>> fixing #3.
>>
>> What's the use case of this patch?
>
> The biggest use case: it enables you to stat() files that have been
> deleted but are still mapped by some process.
>
> This enables a much quicker and more accurate answer to the question
> "How much disk space is being consumed by files that are deleted but
> still mapped?" than is currently possible.
>
> It also allows you to know how much space a specific mapped-but-deleted
> file is using on a specific filesystem, which is currently impossible
> from userspace AFAIK.

Seems reasonable.

It might be nice to have a general interface for enumerating
deleted-but-still-in-use files on a filesystem some day, too.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-05-19  3:10                 ` [PATCH v5] " Calvin Owens
  2015-05-19  3:29                   ` Joe Perches
  2015-05-19 18:04                   ` Andy Lutomirski
@ 2015-06-09  3:39                   ` Calvin Owens
  2015-06-09 17:27                     ` Kees Cook
                                       ` (2 more replies)
  2 siblings, 3 replies; 80+ messages in thread
From: Calvin Owens @ 2015-06-09  3:39 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Eric W. Biederman, Al Viro,
	Miklos Szeredi, Zefan Li, Oleg Nesterov, Joe Perches,
	David Howells
  Cc: Calvin Owens, linux-kernel, kernel-team, Andy Lutomirski,
	Cyrill Gorcunov, Kees Cook, Kirill A. Shutemov

Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
is only exposed if CONFIG_CHECKPOINT_RESTORE is set.

This interface very useful because it allows userspace to stat()
deleted files that are still mapped by some process, which enables a
much quicker and more accurate answer to the question "How much disk
space is being consumed by files that are deleted but still mapped?"
than is currently possible.

This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE,
and adjusts the permissions enforced on it as follows:

* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()

	Remove the CAP_SYS_ADMIN restriction, leaving only the current
	restriction requiring PTRACE_MODE_READ.

	In earlier versions of this patch, I changed the ptrace checks
	in the functions above to enforce MODE_ATTACH instead of
	MODE_READ. That was an oversight: all the information exposed
	by the above three functions is already available with
	MODE_READ from /proc/PID/maps. I was only being asked to
	strengthen the protection around functionality provided by
	follow_link(), not the above.

	So, I've left the checks for MODE_READ as-is, since AFAICS all
	objections raised so far are addressed by the new CAP_SYS_ADMIN
	check in follow_link(), explained below.

* proc_map_files_follow_link()

	This stub has been added, and requires that the user have
	CAP_SYS_ADMIN in order to follow the links in map_files/,
	since there was concern on LKML both about the potential for
	bypassing permissions on ancestor directories in the path to
	files pointed to, and about what happens with more exotic
	memory mappings created by some drivers (ie dma-buf).

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
Changes in v6:	Require CAP_SYS_ADMIN for follow_link(). Leave other
		PTRACE_MODE_READ checks as-is, since CAP_SYS_ADMIN
		alone addresses all concerns raised AFAICS.

Changes in v5:	s/dentry->d_inode/d_inode(dentry)/g

Changes in v4:	Return -ESRCH from follow_link() when get_proc_task()
		returns NULL.

Changes in v3:	Changed permission checks to use PTRACE_MODE_ATTACH
		instead of PTRACE_MODE_READ, and added a stub to
		enforce MODE_ATTACH on follow_link() as well.

Changes in v2:	Removed the follow_link() stub that returned -EPERM if
		the caller didn't have CAP_SYS_ADMIN, since the caller
		in my chroot() scenario gets -EACCES anyway.

 fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 093ca14..0270191 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1641,8 +1641,6 @@ end_instantiate:
 	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-
 /*
  * dname_to_vma_addr - maps a dentry name into two unsigned longs
  * which represent vma start and end addresses.
@@ -1669,11 +1667,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	if (!capable(CAP_SYS_ADMIN)) {
-		status = -EPERM;
-		goto out_notask;
-	}
-
 	inode = d_inode(dentry);
 	task = get_proc_task(inode);
 	if (!task)
@@ -1762,6 +1755,28 @@ struct map_files_info {
 	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
 };
 
+/*
+ * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
+ * symlinks may be used to bypass permissions on ancestor directories in the
+ * path to the file in question.
+ */
+static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	return proc_pid_follow_link(dentry, nd);
+}
+
+/*
+ * Identical to proc_pid_link_inode_operations except for follow_link()
+ */
+static const struct inode_operations proc_map_files_link_inode_operations = {
+	.readlink	= proc_pid_readlink,
+	.follow_link	= proc_map_files_follow_link,
+	.setattr	= proc_setattr,
+};
+
 static int
 proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 			   struct task_struct *task, const void *ptr)
@@ -1777,7 +1792,7 @@ proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 	ei = PROC_I(inode);
 	ei->op.proc_get_link = proc_map_files_get_link;
 
-	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_op = &proc_map_files_link_inode_operations;
 	inode->i_size = 64;
 	inode->i_mode = S_IFLNK;
 
@@ -1801,10 +1816,6 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	int result;
 	struct mm_struct *mm;
 
-	result = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	result = -ENOENT;
 	task = get_proc_task(dir);
 	if (!task)
@@ -1858,10 +1869,6 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	struct map_files_info *p;
 	int ret;
 
-	ret = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	ret = -ENOENT;
 	task = get_proc_task(file_inode(file));
 	if (!task)
@@ -2050,7 +2057,6 @@ static const struct file_operations proc_timers_operations = {
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
-#endif /* CONFIG_CHECKPOINT_RESTORE */
 
 static int proc_pident_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
@@ -2549,9 +2555,7 @@ static const struct inode_operations proc_task_inode_operations;
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-#ifdef CONFIG_CHECKPOINT_RESTORE
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-#endif
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-09  3:39                   ` [PATCH v6] " Calvin Owens
@ 2015-06-09 17:27                     ` Kees Cook
  2015-06-09 17:47                       ` Andy Lutomirski
  2015-06-09 21:13                     ` Andrew Morton
  2015-06-19  2:32                     ` [PATCH v7] " Calvin Owens
  2 siblings, 1 reply; 80+ messages in thread
From: Kees Cook @ 2015-06-09 17:27 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Andrew Morton, Alexey Dobriyan, Eric W. Biederman, Al Viro,
	Miklos Szeredi, Zefan Li, Oleg Nesterov, Joe Perches,
	David Howells, LKML, kernel-team, Andy Lutomirski,
	Cyrill Gorcunov, Kirill A. Shutemov

On Mon, Jun 8, 2015 at 8:39 PM, Calvin Owens <calvinowens@fb.com> wrote:
> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
>
> This interface very useful because it allows userspace to stat()
> deleted files that are still mapped by some process, which enables a
> much quicker and more accurate answer to the question "How much disk
> space is being consumed by files that are deleted but still mapped?"
> than is currently possible.
>
> This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE,
> and adjusts the permissions enforced on it as follows:
>
> * proc_map_files_lookup()
> * proc_map_files_readdir()
> * map_files_d_revalidate()
>
>         Remove the CAP_SYS_ADMIN restriction, leaving only the current
>         restriction requiring PTRACE_MODE_READ.
>
>         In earlier versions of this patch, I changed the ptrace checks
>         in the functions above to enforce MODE_ATTACH instead of
>         MODE_READ. That was an oversight: all the information exposed
>         by the above three functions is already available with
>         MODE_READ from /proc/PID/maps. I was only being asked to
>         strengthen the protection around functionality provided by
>         follow_link(), not the above.
>
>         So, I've left the checks for MODE_READ as-is, since AFAICS all
>         objections raised so far are addressed by the new CAP_SYS_ADMIN
>         check in follow_link(), explained below.
>
> * proc_map_files_follow_link()
>
>         This stub has been added, and requires that the user have
>         CAP_SYS_ADMIN in order to follow the links in map_files/,
>         since there was concern on LKML both about the potential for
>         bypassing permissions on ancestor directories in the path to
>         files pointed to, and about what happens with more exotic
>         memory mappings created by some drivers (ie dma-buf).
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Joe Perches <joe@perches.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Signed-off-by: Calvin Owens <calvinowens@fb.com>
> ---
> Changes in v6:  Require CAP_SYS_ADMIN for follow_link(). Leave other
>                 PTRACE_MODE_READ checks as-is, since CAP_SYS_ADMIN
>                 alone addresses all concerns raised AFAICS.
>
> Changes in v5:  s/dentry->d_inode/d_inode(dentry)/g
>
> Changes in v4:  Return -ESRCH from follow_link() when get_proc_task()
>                 returns NULL.
>
> Changes in v3:  Changed permission checks to use PTRACE_MODE_ATTACH
>                 instead of PTRACE_MODE_READ, and added a stub to
>                 enforce MODE_ATTACH on follow_link() as well.
>
> Changes in v2:  Removed the follow_link() stub that returned -EPERM if
>                 the caller didn't have CAP_SYS_ADMIN, since the caller
>                 in my chroot() scenario gets -EACCES anyway.
>
>  fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
>  1 file changed, 23 insertions(+), 19 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 093ca14..0270191 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1641,8 +1641,6 @@ end_instantiate:
>         return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
>  }
>
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> -
>  /*
>   * dname_to_vma_addr - maps a dentry name into two unsigned longs
>   * which represent vma start and end addresses.
> @@ -1669,11 +1667,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
>         if (flags & LOOKUP_RCU)
>                 return -ECHILD;
>
> -       if (!capable(CAP_SYS_ADMIN)) {
> -               status = -EPERM;
> -               goto out_notask;
> -       }
> -
>         inode = d_inode(dentry);
>         task = get_proc_task(inode);
>         if (!task)
> @@ -1762,6 +1755,28 @@ struct map_files_info {
>         unsigned char   name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
>  };
>
> +/*
> + * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> + * symlinks may be used to bypass permissions on ancestor directories in the
> + * path to the file in question.
> + */

Cool, I think this looks good. Thanks!

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> +static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
> +{
> +       if (!capable(CAP_SYS_ADMIN))
> +               return ERR_PTR(-EPERM);
> +
> +       return proc_pid_follow_link(dentry, nd);
> +}
> +
> +/*
> + * Identical to proc_pid_link_inode_operations except for follow_link()
> + */
> +static const struct inode_operations proc_map_files_link_inode_operations = {
> +       .readlink       = proc_pid_readlink,
> +       .follow_link    = proc_map_files_follow_link,
> +       .setattr        = proc_setattr,
> +};
> +
>  static int
>  proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
>                            struct task_struct *task, const void *ptr)
> @@ -1777,7 +1792,7 @@ proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
>         ei = PROC_I(inode);
>         ei->op.proc_get_link = proc_map_files_get_link;
>
> -       inode->i_op = &proc_pid_link_inode_operations;
> +       inode->i_op = &proc_map_files_link_inode_operations;
>         inode->i_size = 64;
>         inode->i_mode = S_IFLNK;
>
> @@ -1801,10 +1816,6 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
>         int result;
>         struct mm_struct *mm;
>
> -       result = -EPERM;
> -       if (!capable(CAP_SYS_ADMIN))
> -               goto out;
> -
>         result = -ENOENT;
>         task = get_proc_task(dir);
>         if (!task)
> @@ -1858,10 +1869,6 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
>         struct map_files_info *p;
>         int ret;
>
> -       ret = -EPERM;
> -       if (!capable(CAP_SYS_ADMIN))
> -               goto out;
> -
>         ret = -ENOENT;
>         task = get_proc_task(file_inode(file));
>         if (!task)
> @@ -2050,7 +2057,6 @@ static const struct file_operations proc_timers_operations = {
>         .llseek         = seq_lseek,
>         .release        = seq_release_private,
>  };
> -#endif /* CONFIG_CHECKPOINT_RESTORE */
>
>  static int proc_pident_instantiate(struct inode *dir,
>         struct dentry *dentry, struct task_struct *task, const void *ptr)
> @@ -2549,9 +2555,7 @@ static const struct inode_operations proc_task_inode_operations;
>  static const struct pid_entry tgid_base_stuff[] = {
>         DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
>         DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
> -#ifdef CONFIG_CHECKPOINT_RESTORE
>         DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
> -#endif
>         DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
>         DIR("ns",         S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
>  #ifdef CONFIG_NET
> --
> 1.8.1
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-09 17:27                     ` Kees Cook
@ 2015-06-09 17:47                       ` Andy Lutomirski
  2015-06-09 18:15                         ` Cyrill Gorcunov
  0 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2015-06-09 17:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Calvin Owens, Andrew Morton, Alexey Dobriyan, Eric W. Biederman,
	Al Viro, Miklos Szeredi, Zefan Li, Oleg Nesterov, Joe Perches,
	David Howells, LKML, kernel-team, Cyrill Gorcunov,
	Kirill A. Shutemov

On Tue, Jun 9, 2015 at 10:27 AM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jun 8, 2015 at 8:39 PM, Calvin Owens <calvinowens@fb.com> wrote:
>> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
>>
>> This interface very useful because it allows userspace to stat()
>> deleted files that are still mapped by some process, which enables a
>> much quicker and more accurate answer to the question "How much disk
>> space is being consumed by files that are deleted but still mapped?"
>> than is currently possible.
>>
>> This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE,
>> and adjusts the permissions enforced on it as follows:
>>
>> * proc_map_files_lookup()
>> * proc_map_files_readdir()
>> * map_files_d_revalidate()
>>
>>         Remove the CAP_SYS_ADMIN restriction, leaving only the current
>>         restriction requiring PTRACE_MODE_READ.
>>
>>         In earlier versions of this patch, I changed the ptrace checks
>>         in the functions above to enforce MODE_ATTACH instead of
>>         MODE_READ. That was an oversight: all the information exposed
>>         by the above three functions is already available with
>>         MODE_READ from /proc/PID/maps. I was only being asked to
>>         strengthen the protection around functionality provided by
>>         follow_link(), not the above.
>>
>>         So, I've left the checks for MODE_READ as-is, since AFAICS all
>>         objections raised so far are addressed by the new CAP_SYS_ADMIN
>>         check in follow_link(), explained below.
>>
>> * proc_map_files_follow_link()
>>
>>         This stub has been added, and requires that the user have
>>         CAP_SYS_ADMIN in order to follow the links in map_files/,
>>         since there was concern on LKML both about the potential for
>>         bypassing permissions on ancestor directories in the path to
>>         files pointed to, and about what happens with more exotic
>>         memory mappings created by some drivers (ie dma-buf).
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
>> Cc: Joe Perches <joe@perches.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Signed-off-by: Calvin Owens <calvinowens@fb.com>
>> ---
>> Changes in v6:  Require CAP_SYS_ADMIN for follow_link(). Leave other
>>                 PTRACE_MODE_READ checks as-is, since CAP_SYS_ADMIN
>>                 alone addresses all concerns raised AFAICS.
>>
>> Changes in v5:  s/dentry->d_inode/d_inode(dentry)/g
>>
>> Changes in v4:  Return -ESRCH from follow_link() when get_proc_task()
>>                 returns NULL.
>>
>> Changes in v3:  Changed permission checks to use PTRACE_MODE_ATTACH
>>                 instead of PTRACE_MODE_READ, and added a stub to
>>                 enforce MODE_ATTACH on follow_link() as well.
>>
>> Changes in v2:  Removed the follow_link() stub that returned -EPERM if
>>                 the caller didn't have CAP_SYS_ADMIN, since the caller
>>                 in my chroot() scenario gets -EACCES anyway.
>>
>>  fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
>>  1 file changed, 23 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 093ca14..0270191 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -1641,8 +1641,6 @@ end_instantiate:
>>         return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
>>  }
>>
>> -#ifdef CONFIG_CHECKPOINT_RESTORE
>> -
>>  /*
>>   * dname_to_vma_addr - maps a dentry name into two unsigned longs
>>   * which represent vma start and end addresses.
>> @@ -1669,11 +1667,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
>>         if (flags & LOOKUP_RCU)
>>                 return -ECHILD;
>>
>> -       if (!capable(CAP_SYS_ADMIN)) {
>> -               status = -EPERM;
>> -               goto out_notask;
>> -       }
>> -
>>         inode = d_inode(dentry);
>>         task = get_proc_task(inode);
>>         if (!task)
>> @@ -1762,6 +1755,28 @@ struct map_files_info {
>>         unsigned char   name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
>>  };
>>
>> +/*
>> + * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
>> + * symlinks may be used to bypass permissions on ancestor directories in the
>> + * path to the file in question.
>> + */
>
> Cool, I think this looks good. Thanks!
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
>

Looks good to me, too.

--Andy

> -Kees
>
>> +static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
>> +{
>> +       if (!capable(CAP_SYS_ADMIN))
>> +               return ERR_PTR(-EPERM);
>> +
>> +       return proc_pid_follow_link(dentry, nd);
>> +}
>> +
>> +/*
>> + * Identical to proc_pid_link_inode_operations except for follow_link()
>> + */
>> +static const struct inode_operations proc_map_files_link_inode_operations = {
>> +       .readlink       = proc_pid_readlink,
>> +       .follow_link    = proc_map_files_follow_link,
>> +       .setattr        = proc_setattr,
>> +};
>> +
>>  static int
>>  proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
>>                            struct task_struct *task, const void *ptr)
>> @@ -1777,7 +1792,7 @@ proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
>>         ei = PROC_I(inode);
>>         ei->op.proc_get_link = proc_map_files_get_link;
>>
>> -       inode->i_op = &proc_pid_link_inode_operations;
>> +       inode->i_op = &proc_map_files_link_inode_operations;
>>         inode->i_size = 64;
>>         inode->i_mode = S_IFLNK;
>>
>> @@ -1801,10 +1816,6 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
>>         int result;
>>         struct mm_struct *mm;
>>
>> -       result = -EPERM;
>> -       if (!capable(CAP_SYS_ADMIN))
>> -               goto out;
>> -
>>         result = -ENOENT;
>>         task = get_proc_task(dir);
>>         if (!task)
>> @@ -1858,10 +1869,6 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
>>         struct map_files_info *p;
>>         int ret;
>>
>> -       ret = -EPERM;
>> -       if (!capable(CAP_SYS_ADMIN))
>> -               goto out;
>> -
>>         ret = -ENOENT;
>>         task = get_proc_task(file_inode(file));
>>         if (!task)
>> @@ -2050,7 +2057,6 @@ static const struct file_operations proc_timers_operations = {
>>         .llseek         = seq_lseek,
>>         .release        = seq_release_private,
>>  };
>> -#endif /* CONFIG_CHECKPOINT_RESTORE */
>>
>>  static int proc_pident_instantiate(struct inode *dir,
>>         struct dentry *dentry, struct task_struct *task, const void *ptr)
>> @@ -2549,9 +2555,7 @@ static const struct inode_operations proc_task_inode_operations;
>>  static const struct pid_entry tgid_base_stuff[] = {
>>         DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
>>         DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
>> -#ifdef CONFIG_CHECKPOINT_RESTORE
>>         DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
>> -#endif
>>         DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
>>         DIR("ns",         S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
>>  #ifdef CONFIG_NET
>> --
>> 1.8.1
>>
>
>
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-09 17:47                       ` Andy Lutomirski
@ 2015-06-09 18:15                         ` Cyrill Gorcunov
  0 siblings, 0 replies; 80+ messages in thread
From: Cyrill Gorcunov @ 2015-06-09 18:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Calvin Owens, Andrew Morton, Alexey Dobriyan,
	Eric W. Biederman, Al Viro, Miklos Szeredi, Zefan Li,
	Oleg Nesterov, Joe Perches, David Howells, LKML, kernel-team,
	Kirill A. Shutemov

On Tue, Jun 09, 2015 at 10:47:31AM -0700, Andy Lutomirski wrote:
...
> >
> > Cool, I think this looks good. Thanks!
> >
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> >
> 
> Looks good to me, too.

Wow! Great job, Calvin, thanks!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-09  3:39                   ` [PATCH v6] " Calvin Owens
  2015-06-09 17:27                     ` Kees Cook
@ 2015-06-09 21:13                     ` Andrew Morton
  2015-06-10  1:39                       ` Calvin Owens
  2015-06-19  2:32                     ` [PATCH v7] " Calvin Owens
  2 siblings, 1 reply; 80+ messages in thread
From: Andrew Morton @ 2015-06-09 21:13 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Alexey Dobriyan, Eric W. Biederman, Al Viro, Miklos Szeredi,
	Zefan Li, Oleg Nesterov, Joe Perches, David Howells,
	linux-kernel, kernel-team, Andy Lutomirski, Cyrill Gorcunov,
	Kees Cook, Kirill A. Shutemov

On Mon, 8 Jun 2015 20:39:33 -0700 Calvin Owens <calvinowens@fb.com> wrote:

> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> 
> This interface very useful because it allows userspace to stat()
> deleted files that are still mapped by some process, which enables a
> much quicker and more accurate answer to the question "How much disk
> space is being consumed by files that are deleted but still mapped?"
> than is currently possible.

Why is that information useful?

I could perhaps think of some use for "How much disk space is being
consumed by files that are deleted but still open", but to count the
mmapped-then-unlinked files while excluding the opened-then-unlinked
files seems damned peculiar.

IOW, this changelog failed to explain the value of the patch.  Bad
changelog!  Please sell it to us.  Preferably with real-world use
cases.




^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-09 21:13                     ` Andrew Morton
@ 2015-06-10  1:39                       ` Calvin Owens
  2015-06-10 20:58                         ` Andrew Morton
  0 siblings, 1 reply; 80+ messages in thread
From: Calvin Owens @ 2015-06-10  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexey Dobriyan, Eric W. Biederman, Al Viro, Miklos Szeredi,
	Zefan Li, Oleg Nesterov, Joe Perches, David Howells,
	linux-kernel, kernel-team, Andy Lutomirski, Cyrill Gorcunov,
	Kees Cook, Kirill A. Shutemov

On Tuesday 06/09 at 14:13 -0700, Andrew Morton wrote:
> On Mon, 8 Jun 2015 20:39:33 -0700 Calvin Owens <calvinowens@fb.com> wrote:
> 
> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> > 
> > This interface very useful because it allows userspace to stat()
> > deleted files that are still mapped by some process, which enables a
> > much quicker and more accurate answer to the question "How much disk
> > space is being consumed by files that are deleted but still mapped?"
> > than is currently possible.
> 
> Why is that information useful?
> 
> I could perhaps think of some use for "How much disk space is being
> consumed by files that are deleted but still open", but to count the
> mmapped-then-unlinked files while excluding the opened-then-unlinked
> files seems damned peculiar.

Let's phrase the question a bit more generically:

"How much disk space is being consumed by files that have been
unlinked, but are still referenced by some process?"

There are two pieces to this problem:
	1) Unlinked files that are still open (whether mapped or not)
	2) Unlinked files that are not open, but are still mapped

You can track down everything in (1) using /proc/<pid>/fd/*, and you
can use stat() to figure out how much space they're using.

But directly measuring how much space (2) consumes is actually not
currently possible from userspace: there's no way to stat() the files.
You can get the inode number from /proc/<pid>/maps, but that still
doesn't get you anywhere because it's been unlinked from the
filesystem.

So I'm not looking to measure (2) and exclude (1): I'm looking to have
a way to directly measure (2) at all.

The reason I say "directly", and I say "quicker and more accurate" in
the original message, is that there is a very ugly way to answer this
question right now: you sum up the number of blocks used by every file
on the disk and subtract it from what statfs() tells you. This
obviously stinks, and becomes untenable once your filesystem is large
enough.
 
> IOW, this changelog failed to explain the value of the patch.  Bad
> changelog!  Please sell it to us.  Preferably with real-world use
> cases.

The real-world use case is catching long-lived processes that leak
references to temporary files and waste space on the disk. When such
processes leak file-backed mappings, this wasted space is especially
difficult to detect until it gets out of hand. The map_files/
interface eliminates this difficulty.

I've included a little test program at the end of this file to illustrate
what I'm getting at here. It creates a file at /tmp/DELETEDFILE:

	calvinowens@Haydn:~$ gcc test.c 
	calvinowens@Haydn:~$ ./a.out &
	[1] 5832
	Holding mapping at 0x7fe74d1ea000
	calvinowens@Haydn:~$ lsof -p `pgrep a.out`
	COMMAND  PID        USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
	a.out   5832 calvinowens  cwd    DIR  254,1     4096 3413033 /home/calvinowens
	a.out   5832 calvinowens  rtd    DIR  254,1     4096       2 /
	a.out   5832 calvinowens  txt    REG  254,1     7512 3408268 /home/calvinowens/a.out
	a.out   5832 calvinowens  mem    REG  254,1  1729984 4456767 /lib/x86_64-linux-gnu/libc-2.19.so
	a.out   5832 calvinowens  mem    REG  254,1   140928 4456619 /lib/x86_64-linux-gnu/ld-2.19.so
	a.out   5832 calvinowens  mem    REG   0,32    32768  184946 /tmp/DELETEDFILE
	a.out   5832 calvinowens    0u   CHR  136,2      0t0       5 /dev/pts/2
	a.out   5832 calvinowens    1u   CHR  136,2      0t0       5 /dev/pts/2
	a.out   5832 calvinowens    2u   CHR  136,2      0t0       5 /dev/pts/2
	calvinowens@Haydn:~$ killall a.out
	[1]+  Terminated              ./a.out
	calvinowens@Haydn:~$ gcc -DDO_UNLINK test.c 
	calvinowens@Haydn:~$ ./a.out &
	[1] 5842
	Holding mapping at 0x7fec8ae63000
	calvinowens@Haydn:~$ lsof -p `pgrep a.out`
	COMMAND  PID        USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
	a.out   5842 calvinowens  cwd    DIR  254,1     4096 3413033 /home/calvinowens
	a.out   5842 calvinowens  rtd    DIR  254,1     4096       2 /
	a.out   5842 calvinowens  txt    REG  254,1     7640 3408268 /home/calvinowens/a.out
	a.out   5842 calvinowens  mem    REG  254,1  1729984 4456767 /lib/x86_64-linux-gnu/libc-2.19.so
	a.out   5842 calvinowens  mem    REG  254,1   140928 4456619 /lib/x86_64-linux-gnu/ld-2.19.so
	a.out   5842 calvinowens  DEL    REG   0,32           184946 /tmp/DELETEDFILE
	a.out   5842 calvinowens    0u   CHR  136,2      0t0       5 /dev/pts/2
	a.out   5842 calvinowens    1u   CHR  136,2      0t0       5 /dev/pts/2
	a.out   5842 calvinowens    2u   CHR  136,2      0t0       5 /dev/pts/2

Notice the gap under "SIZE/OFF" in the 2nd output? This is because lsof
has no possible way to actually determine the leaked file's size.
That's the functionality "hole" I'm trying to fill with this patch.

Does that all seem sensible?

Thanks,
Calvin

--
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <limits.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>

int main(void)
{
	int ret, fd;
	void *map;

	fd = open("/tmp/DELETEDFILE", O_CREAT|O_TRUNC|O_RDWR, 0777);
	if (fd == -1)
		return -1;

	ret = ftruncate(fd, 32768);
	if (ret == -1)
		return -1;

	map = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
			fd, 0);
	if (map == MAP_FAILED)
		return -1;

	close(fd);
	#ifdef DO_UNLINK
	unlink("/tmp/DELETEDFILE");
	#endif

	printf("Holding mapping at %p\n", map);
	while (1)
		sleep(UINT_MAX);
}

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-10  1:39                       ` Calvin Owens
@ 2015-06-10 20:58                         ` Andrew Morton
  2015-06-11 11:10                           ` Alexey Dobriyan
  0 siblings, 1 reply; 80+ messages in thread
From: Andrew Morton @ 2015-06-10 20:58 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Alexey Dobriyan, Eric W. Biederman, Al Viro, Miklos Szeredi,
	Zefan Li, Oleg Nesterov, Joe Perches, David Howells,
	linux-kernel, kernel-team, Andy Lutomirski, Cyrill Gorcunov,
	Kees Cook, Kirill A. Shutemov

On Tue, 9 Jun 2015 18:39:02 -0700 Calvin Owens <calvinowens@fb.com> wrote:

> On Tuesday 06/09 at 14:13 -0700, Andrew Morton wrote:
> > On Mon, 8 Jun 2015 20:39:33 -0700 Calvin Owens <calvinowens@fb.com> wrote:
> > 
> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> > > 
> > > This interface very useful because it allows userspace to stat()
> > > deleted files that are still mapped by some process, which enables a
> > > much quicker and more accurate answer to the question "How much disk
> > > space is being consumed by files that are deleted but still mapped?"
> > > than is currently possible.
> > 
> > Why is that information useful?
> > 
> > I could perhaps think of some use for "How much disk space is being
> > consumed by files that are deleted but still open", but to count the
> > mmapped-then-unlinked files while excluding the opened-then-unlinked
> > files seems damned peculiar.
> 
> Let's phrase the question a bit more generically:
> 
> "How much disk space is being consumed by files that have been
> unlinked, but are still referenced by some process?"
> 
> There are two pieces to this problem:
> 	1) Unlinked files that are still open (whether mapped or not)
> 	2) Unlinked files that are not open, but are still mapped
> 
> You can track down everything in (1) using /proc/<pid>/fd/*, and you
> can use stat() to figure out how much space they're using.

This doesn't work if the mapped file has been unlinked?  What does the
/proc/pid/map_files listing look like for these?

> Does that all seem sensible?

Spose so.  Please capture all this info in the changelog.


It all seems a bit awkward though.  If we want to know "how much disk
space is this process using" (or similar) then I wonder what a syscall
(or prctl mode?) which does this would look like.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-10 20:58                         ` Andrew Morton
@ 2015-06-11 11:10                           ` Alexey Dobriyan
  2015-06-11 18:49                             ` Andrew Morton
  0 siblings, 1 reply; 80+ messages in thread
From: Alexey Dobriyan @ 2015-06-11 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Calvin Owens, Eric W. Biederman, Al Viro, Miklos Szeredi,
	Zefan Li, Oleg Nesterov, Joe Perches, David Howells,
	Linux Kernel, kernel-team, Andy Lutomirski, Cyrill Gorcunov,
	Kees Cook, Kirill A. Shutemov

On Wed, Jun 10, 2015 at 11:58 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 9 Jun 2015 18:39:02 -0700 Calvin Owens <calvinowens@fb.com> wrote:
>
>> On Tuesday 06/09 at 14:13 -0700, Andrew Morton wrote:
>> > On Mon, 8 Jun 2015 20:39:33 -0700 Calvin Owens <calvinowens@fb.com> wrote:
>> >
>> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
>> > >
>> > > This interface very useful because it allows userspace to stat()
>> > > deleted files that are still mapped by some process, which enables a
>> > > much quicker and more accurate answer to the question "How much disk
>> > > space is being consumed by files that are deleted but still mapped?"
>> > > than is currently possible.
>> >
>> > Why is that information useful?
>> >
>> > I could perhaps think of some use for "How much disk space is being
>> > consumed by files that are deleted but still open", but to count the
>> > mmapped-then-unlinked files while excluding the opened-then-unlinked
>> > files seems damned peculiar.
>>
>> Let's phrase the question a bit more generically:
>>
>> "How much disk space is being consumed by files that have been
>> unlinked, but are still referenced by some process?"
>>
>> There are two pieces to this problem:
>>       1) Unlinked files that are still open (whether mapped or not)
>>       2) Unlinked files that are not open, but are still mapped
>>
>> You can track down everything in (1) using /proc/<pid>/fd/*, and you
>> can use stat() to figure out how much space they're using.
>
> This doesn't work if the mapped file has been unlinked?  What does the
> /proc/pid/map_files listing look like for these?

It says "(deleted)" like /proc/*/exe or any other symlink.

>> Does that all seem sensible?
>
> Spose so.  Please capture all this info in the changelog.
>
>
> It all seems a bit awkward though.  If we want to know "how much disk
> space is this process using" (or similar) then I wonder what a syscall
> (or prctl mode?) which does this would look like.

I believe something like this is needed for checkpointing,
otherwise mmaped but unlinked files could not be restored fully
(how do you reach them?).

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-11 11:10                           ` Alexey Dobriyan
@ 2015-06-11 18:49                             ` Andrew Morton
  2015-06-12  9:55                               ` Alexey Dobriyan
  0 siblings, 1 reply; 80+ messages in thread
From: Andrew Morton @ 2015-06-11 18:49 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Calvin Owens, Eric W. Biederman, Al Viro, Miklos Szeredi,
	Zefan Li, Oleg Nesterov, Joe Perches, David Howells,
	Linux Kernel, kernel-team, Andy Lutomirski, Cyrill Gorcunov,
	Kees Cook, Kirill A. Shutemov

On Thu, 11 Jun 2015 14:10:45 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:

> On Wed, Jun 10, 2015 at 11:58 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Tue, 9 Jun 2015 18:39:02 -0700 Calvin Owens <calvinowens@fb.com> wrote:
> >
> >> On Tuesday 06/09 at 14:13 -0700, Andrew Morton wrote:
> >> > On Mon, 8 Jun 2015 20:39:33 -0700 Calvin Owens <calvinowens@fb.com> wrote:
> >> >
> >> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> >> > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> >> > >
> >> > > This interface very useful because it allows userspace to stat()
> >> > > deleted files that are still mapped by some process, which enables a
> >> > > much quicker and more accurate answer to the question "How much disk
> >> > > space is being consumed by files that are deleted but still mapped?"
> >> > > than is currently possible.
> >> >
> >> > Why is that information useful?
> >> >
> >> > I could perhaps think of some use for "How much disk space is being
> >> > consumed by files that are deleted but still open", but to count the
> >> > mmapped-then-unlinked files while excluding the opened-then-unlinked
> >> > files seems damned peculiar.
> >>
> >> Let's phrase the question a bit more generically:
> >>
> >> "How much disk space is being consumed by files that have been
> >> unlinked, but are still referenced by some process?"
> >>
> >> There are two pieces to this problem:
> >>       1) Unlinked files that are still open (whether mapped or not)
> >>       2) Unlinked files that are not open, but are still mapped
> >>
> >> You can track down everything in (1) using /proc/<pid>/fd/*, and you
> >> can use stat() to figure out how much space they're using.
> >
> > This doesn't work if the mapped file has been unlinked?  What does the
> > /proc/pid/map_files listing look like for these?
> 
> It says "(deleted)" like /proc/*/exe or any other symlink.

Actually the symlink directs at "/home/akpm/foo (deleted)".

And lo, if you do `stat -L' on the symlink, you get the info for the
unlinked-but-still-mmapped inode.  I never knew that.  And I wouldn't
have learned it from the documentation, which is careful to keep all
this a secret.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v6] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-11 18:49                             ` Andrew Morton
@ 2015-06-12  9:55                               ` Alexey Dobriyan
  0 siblings, 0 replies; 80+ messages in thread
From: Alexey Dobriyan @ 2015-06-12  9:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Calvin Owens, Eric W. Biederman, Al Viro, Miklos Szeredi,
	Zefan Li, Oleg Nesterov, Joe Perches, David Howells,
	Linux Kernel, kernel-team, Andy Lutomirski, Cyrill Gorcunov,
	Kees Cook, Kirill A. Shutemov

On Thu, Jun 11, 2015 at 9:49 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 11 Jun 2015 14:10:45 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
>> On Wed, Jun 10, 2015 at 11:58 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>> > On Tue, 9 Jun 2015 18:39:02 -0700 Calvin Owens <calvinowens@fb.com> wrote:
>> >
>> >> On Tuesday 06/09 at 14:13 -0700, Andrew Morton wrote:
>> >> > On Mon, 8 Jun 2015 20:39:33 -0700 Calvin Owens <calvinowens@fb.com> wrote:
>> >> >
>> >> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> >> > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
>> >> > >
>> >> > > This interface very useful because it allows userspace to stat()
>> >> > > deleted files that are still mapped by some process, which enables a
>> >> > > much quicker and more accurate answer to the question "How much disk
>> >> > > space is being consumed by files that are deleted but still mapped?"
>> >> > > than is currently possible.
>> >> >
>> >> > Why is that information useful?
>> >> >
>> >> > I could perhaps think of some use for "How much disk space is being
>> >> > consumed by files that are deleted but still open", but to count the
>> >> > mmapped-then-unlinked files while excluding the opened-then-unlinked
>> >> > files seems damned peculiar.
>> >>
>> >> Let's phrase the question a bit more generically:
>> >>
>> >> "How much disk space is being consumed by files that have been
>> >> unlinked, but are still referenced by some process?"
>> >>
>> >> There are two pieces to this problem:
>> >>       1) Unlinked files that are still open (whether mapped or not)
>> >>       2) Unlinked files that are not open, but are still mapped
>> >>
>> >> You can track down everything in (1) using /proc/<pid>/fd/*, and you
>> >> can use stat() to figure out how much space they're using.
>> >
>> > This doesn't work if the mapped file has been unlinked?  What does the
>> > /proc/pid/map_files listing look like for these?
>>
>> It says "(deleted)" like /proc/*/exe or any other symlink.
>
> Actually the symlink directs at "/home/akpm/foo (deleted)".

I meant exactly that: full path + (deleted).

> And lo, if you do `stat -L' on the symlink, you get the info for the
> unlinked-but-still-mmapped inode.  I never knew that.  And I wouldn't
> have learned it from the documentation, which is careful to keep all
> this a secret.

Yes, map_files symlinks allow to reach descriptors in the very same way
/proc/*/fd symlinks allow to reach descriptors.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v7] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-09  3:39                   ` [PATCH v6] " Calvin Owens
  2015-06-09 17:27                     ` Kees Cook
  2015-06-09 21:13                     ` Andrew Morton
@ 2015-06-19  2:32                     ` Calvin Owens
  2015-07-15 22:21                       ` Andrew Morton
  2 siblings, 1 reply; 80+ messages in thread
From: Calvin Owens @ 2015-06-19  2:32 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Eric W. Biederman, Al Viro,
	Miklos Szeredi, Zefan Li, Oleg Nesterov, Joe Perches,
	David Howells
  Cc: linux-kernel, kernel-team, calvinowens, keescook,
	Andy Lutomirski, Cyrill Gorcunov, Kirill A. Shutemov

Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
is only exposed if CONFIG_CHECKPOINT_RESTORE is set.

Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow
them to the backing file even if that backing file has been unlinked.

Currently, files which are mapped, unlinked, and closed are impossible
to stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".

Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your
filesystem becomes large enough.

This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE,
and adjusts the permissions enforced on it as follows:

* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()

	Remove the CAP_SYS_ADMIN restriction, leaving only the current
	restriction requiring PTRACE_MODE_READ. The information made
	available to userspace by these three functions is already
	available in /proc/PID/maps with MODE_READ, so I don't see any
	reason to limit them any further (see below for more detail).

* proc_map_files_follow_link()

	This stub has been added, and requires that the user have
	CAP_SYS_ADMIN in order to follow the links in map_files/,
	since there was concern on LKML both about the potential for
	bypassing permissions on ancestor directories in the path to
	files pointed to, and about what happens with more exotic
	memory mappings created by some drivers (ie dma-buf).

In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
Changes in v7:	Better commit message (hopefully), patch is otherwise
		identical to v6.

Changes in v6:	Require CAP_SYS_ADMIN for follow_link(). Leave other
		PTRACE_MODE_READ checks as-is, since CAP_SYS_ADMIN
		alone addresses all concerns raised AFAICS.

Changes in v5:	s/dentry->d_inode/d_inode(dentry)/g

Changes in v4:	Return -ESRCH from follow_link() when get_proc_task()
		returns NULL.

Changes in v3:	Changed permission checks to use PTRACE_MODE_ATTACH
		instead of PTRACE_MODE_READ, and added a stub to
		enforce MODE_ATTACH on follow_link() as well.

Changes in v2:	Removed the follow_link() stub that returned -EPERM if
		the caller didn't have CAP_SYS_ADMIN, since the caller
		in my chroot() scenario gets -EACCES anyway.


 fs/proc/base.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 093ca14..0270191 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1641,8 +1641,6 @@ end_instantiate:
 	return dir_emit(ctx, name, len, 1, DT_UNKNOWN);
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-
 /*
  * dname_to_vma_addr - maps a dentry name into two unsigned longs
  * which represent vma start and end addresses.
@@ -1669,11 +1667,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	if (!capable(CAP_SYS_ADMIN)) {
-		status = -EPERM;
-		goto out_notask;
-	}
-
 	inode = d_inode(dentry);
 	task = get_proc_task(inode);
 	if (!task)
@@ -1762,6 +1755,28 @@ struct map_files_info {
 	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
 };
 
+/*
+ * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
+ * symlinks may be used to bypass permissions on ancestor directories in the
+ * path to the file in question.
+ */
+static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	return proc_pid_follow_link(dentry, nd);
+}
+
+/*
+ * Identical to proc_pid_link_inode_operations except for follow_link()
+ */
+static const struct inode_operations proc_map_files_link_inode_operations = {
+	.readlink	= proc_pid_readlink,
+	.follow_link	= proc_map_files_follow_link,
+	.setattr	= proc_setattr,
+};
+
 static int
 proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 			   struct task_struct *task, const void *ptr)
@@ -1777,7 +1792,7 @@ proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
 	ei = PROC_I(inode);
 	ei->op.proc_get_link = proc_map_files_get_link;
 
-	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_op = &proc_map_files_link_inode_operations;
 	inode->i_size = 64;
 	inode->i_mode = S_IFLNK;
 
@@ -1801,10 +1816,6 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
 	int result;
 	struct mm_struct *mm;
 
-	result = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	result = -ENOENT;
 	task = get_proc_task(dir);
 	if (!task)
@@ -1858,10 +1869,6 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	struct map_files_info *p;
 	int ret;
 
-	ret = -EPERM;
-	if (!capable(CAP_SYS_ADMIN))
-		goto out;
-
 	ret = -ENOENT;
 	task = get_proc_task(file_inode(file));
 	if (!task)
@@ -2050,7 +2057,6 @@ static const struct file_operations proc_timers_operations = {
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
-#endif /* CONFIG_CHECKPOINT_RESTORE */
 
 static int proc_pident_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
@@ -2549,9 +2555,7 @@ static const struct inode_operations proc_task_inode_operations;
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-#ifdef CONFIG_CHECKPOINT_RESTORE
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-#endif
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v7] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-06-19  2:32                     ` [PATCH v7] " Calvin Owens
@ 2015-07-15 22:21                       ` Andrew Morton
  2015-07-15 23:39                         ` Calvin Owens
  0 siblings, 1 reply; 80+ messages in thread
From: Andrew Morton @ 2015-07-15 22:21 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Alexey Dobriyan, Eric W. Biederman, Al Viro, Miklos Szeredi,
	Zefan Li, Oleg Nesterov, Joe Perches, David Howells,
	linux-kernel, kernel-team, keescook, Andy Lutomirski,
	Cyrill Gorcunov, Kirill A. Shutemov

On Thu, 18 Jun 2015 19:32:18 -0700 Calvin Owens <calvinowens@fb.com> wrote:

> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> 
> Each mapped file region gets a symlink in /proc/<pid>/map_files/
> corresponding to the virtual address range at which it is mapped. The
> symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow
> them to the backing file even if that backing file has been unlinked.
> 
> Currently, files which are mapped, unlinked, and closed are impossible
> to stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
> functionality "hole".
> 
> Not being able to stat() such files makes noticing and explicitly
> accounting for the space they use on the filesystem impossible. You can
> work around this by summing up the space used by every file in the
> filesystem and subtracting that total from what statfs() tells you, but
> that obviously isn't great, and it becomes unworkable once your
> filesystem becomes large enough.
> 
> This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE,
> and adjusts the permissions enforced on it as follows:

proc_pid_follow_link() got changed while you weren't looking, causing

fs/proc/base.c: In function 'proc_map_files_follow_link':
fs/proc/base.c:1963: warning: passing argument 2 of 'proc_pid_follow_link' from incompatible pointer type
fs/proc/base.c:1578: note: expected 'void **' but argument is of type 'struct nameidata *'
fs/proc/base.c:1963: warning: return discards qualifiers from pointer target type
fs/proc/base.c: At top level:
fs/proc/base.c:1971: warning: initialization from incompatible pointer type

I just changed it to pass NULL:

--- a/fs/proc/base.c~procfs-always-expose-proc-pid-map_files-and-make-it-readable-fix
+++ a/fs/proc/base.c
@@ -1955,12 +1955,13 @@ struct map_files_info {
  * symlinks may be used to bypass permissions on ancestor directories in the
  * path to the file in question.
  */
-static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void *
+proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	if (!capable(CAP_SYS_ADMIN))
 		return ERR_PTR(-EPERM);
 
-	return proc_pid_follow_link(dentry, nd);
+	return proc_pid_follow_link(dentry, NULL);
 }
 
 /*
_


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v7] procfs: Always expose /proc/<pid>/map_files/ and make it readable
  2015-07-15 22:21                       ` Andrew Morton
@ 2015-07-15 23:39                         ` Calvin Owens
  0 siblings, 0 replies; 80+ messages in thread
From: Calvin Owens @ 2015-07-15 23:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexey Dobriyan, Eric W. Biederman, Al Viro, Miklos Szeredi,
	Zefan Li, Oleg Nesterov, Joe Perches, David Howells,
	linux-kernel, kernel-team, keescook, Andy Lutomirski,
	Cyrill Gorcunov, Kirill A. Shutemov

On Wednesday 07/15 at 15:21 -0700, Andrew Morton wrote:
> On Thu, 18 Jun 2015 19:32:18 -0700 Calvin Owens <calvinowens@fb.com> wrote:
> 
> > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
> > is only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> > 
> > Each mapped file region gets a symlink in /proc/<pid>/map_files/
> > corresponding to the virtual address range at which it is mapped. The
> > symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow
> > them to the backing file even if that backing file has been unlinked.
> > 
> > Currently, files which are mapped, unlinked, and closed are impossible
> > to stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
> > functionality "hole".
> > 
> > Not being able to stat() such files makes noticing and explicitly
> > accounting for the space they use on the filesystem impossible. You can
> > work around this by summing up the space used by every file in the
> > filesystem and subtracting that total from what statfs() tells you, but
> > that obviously isn't great, and it becomes unworkable once your
> > filesystem becomes large enough.
> > 
> > This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE,
> > and adjusts the permissions enforced on it as follows:
> 
> proc_pid_follow_link() got changed while you weren't looking, causing
> 
> fs/proc/base.c: In function 'proc_map_files_follow_link':
> fs/proc/base.c:1963: warning: passing argument 2 of 'proc_pid_follow_link' from incompatible pointer type
> fs/proc/base.c:1578: note: expected 'void **' but argument is of type 'struct nameidata *'
> fs/proc/base.c:1963: warning: return discards qualifiers from pointer target type
> fs/proc/base.c: At top level:
> fs/proc/base.c:1971: warning: initialization from incompatible pointer type
> 
> I just changed it to pass NULL:

Thanks for cleaning this up, I'll make sure to check outstanding patches
against new -rcs and -nexts in the future.

Thanks,
Calvin

> --- a/fs/proc/base.c~procfs-always-expose-proc-pid-map_files-and-make-it-readable-fix
> +++ a/fs/proc/base.c
> @@ -1955,12 +1955,13 @@ struct map_files_info {
>   * symlinks may be used to bypass permissions on ancestor directories in the
>   * path to the file in question.
>   */
> -static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
> +static void *
> +proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd)
>  {
>  	if (!capable(CAP_SYS_ADMIN))
>  		return ERR_PTR(-EPERM);
>  
> -	return proc_pid_follow_link(dentry, nd);
> +	return proc_pid_follow_link(dentry, NULL);
>  }
>  
>  /*
> _
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2015-07-15 23:40 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-14  0:20 [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files Calvin Owens
2015-01-14  0:23 ` Calvin Owens
2015-01-14 14:13 ` Rasmus Villemoes
2015-01-14 14:37   ` Siddhesh Poyarekar
2015-01-14 14:53     ` Rasmus Villemoes
2015-01-14 21:03       ` Calvin Owens
2015-01-14 22:45         ` Andrew Morton
2015-01-14 23:51           ` Rasmus Villemoes
2015-01-16  1:15             ` Andrew Morton
2015-01-16 11:00               ` Kirill A. Shutemov
2015-01-14 15:25 ` Kirill A. Shutemov
2015-01-14 15:33   ` Cyrill Gorcunov
2015-01-14 20:46     ` Calvin Owens
2015-01-14 21:16       ` Cyrill Gorcunov
2015-01-22  2:45         ` [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable Calvin Owens
2015-01-22  7:16           ` Cyrill Gorcunov
2015-01-22 11:02           ` Kirill A. Shutemov
2015-01-22 21:00             ` Calvin Owens
2015-01-22 21:27               ` Kirill A. Shutemov
2015-01-23  5:52                 ` Calvin Owens
2015-01-24  3:15           ` [RFC][PATCH v2] " Calvin Owens
2015-01-26 12:47             ` Kirill A. Shutemov
2015-01-26 21:00               ` Cyrill Gorcunov
2015-01-26 21:00                 ` Cyrill Gorcunov
2015-01-26 23:43                 ` Andrew Morton
2015-01-27  0:15                   ` Kees Cook
2015-01-27  0:15                     ` Kees Cook
2015-01-27  7:37                     ` Cyrill Gorcunov
2015-01-27  7:37                       ` Cyrill Gorcunov
2015-01-27 19:53                       ` Kees Cook
2015-01-27 19:53                         ` Kees Cook
2015-01-27 21:35                         ` Cyrill Gorcunov
2015-01-27 21:35                           ` Cyrill Gorcunov
2015-01-27 21:46                         ` Pavel Emelyanov
2015-01-27 21:46                           ` Pavel Emelyanov
2015-01-27  0:19                   ` Kirill A. Shutemov
2015-01-27  0:19                     ` Kirill A. Shutemov
2015-01-27  6:46                   ` Cyrill Gorcunov
2015-01-27  6:46                     ` Cyrill Gorcunov
2015-01-27  6:50                     ` Andrew Morton
2015-01-27  7:23                       ` Cyrill Gorcunov
2015-01-27  7:23                         ` Cyrill Gorcunov
2015-01-28  4:38                   ` Calvin Owens
2015-01-28  4:38                     ` Calvin Owens
2015-01-30  1:30                     ` Kees Cook
2015-01-30  1:30                       ` Kees Cook
2015-01-31  1:58                       ` Calvin Owens
2015-01-31  1:58                         ` Calvin Owens
2015-02-02 14:01                         ` Austin S Hemmelgarn
2015-02-04  3:53                           ` Calvin Owens
2015-02-04  3:53                             ` Calvin Owens
2015-02-02 20:16                         ` Andy Lutomirski
2015-02-04  3:28                           ` Calvin Owens
2015-02-04  3:28                             ` Calvin Owens
2015-02-12  2:29             ` [RFC][PATCH v3] " Calvin Owens
2015-02-12  7:45               ` Cyrill Gorcunov
2015-02-14 20:40               ` [RFC][PATCH v4] " Calvin Owens
2015-03-10 22:17                 ` Cyrill Gorcunov
2015-04-28 22:23                   ` Calvin Owens
2015-04-29  7:32                     ` Cyrill Gorcunov
2015-05-19  3:10                 ` [PATCH v5] " Calvin Owens
2015-05-19  3:29                   ` Joe Perches
2015-05-19 18:04                   ` Andy Lutomirski
2015-05-21  1:52                     ` Calvin Owens
2015-05-21  2:10                       ` Andy Lutomirski
2015-06-09  3:39                   ` [PATCH v6] " Calvin Owens
2015-06-09 17:27                     ` Kees Cook
2015-06-09 17:47                       ` Andy Lutomirski
2015-06-09 18:15                         ` Cyrill Gorcunov
2015-06-09 21:13                     ` Andrew Morton
2015-06-10  1:39                       ` Calvin Owens
2015-06-10 20:58                         ` Andrew Morton
2015-06-11 11:10                           ` Alexey Dobriyan
2015-06-11 18:49                             ` Andrew Morton
2015-06-12  9:55                               ` Alexey Dobriyan
2015-06-19  2:32                     ` [PATCH v7] " Calvin Owens
2015-07-15 22:21                       ` Andrew Morton
2015-07-15 23:39                         ` Calvin Owens
2015-02-14 20:44               ` [PATCH] procfs: Return -ESRCH on /proc/N/fd/* when PID N doesn't exist Calvin Owens
2015-01-14 22:40 ` [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.