All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups
@ 2014-08-05 19:46 Oleg Nesterov
  2014-08-05 19:46 ` [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map() Oleg Nesterov
                   ` (12 more replies)
  0 siblings, 13 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-05 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Changes:

	1-4: Update the changelogs, join "introduce the stable
	     proc_maps_private->mm" and "change m_start() to rely on
	     priv->mm" into a single "shift "priv->task = NULL" from
	     m_start() to m_stop()".

	     Resulting code is the same. Kirill and Cyrill, you seem
	     to agree with these changes, I'll appreciate your acks.

	5-7: New. Seems to work, please review.

Todo:

	- Cleanup the tail_vma horror in m_start

	- Update task_nommu.c in the same way

	- Fix lock_trace() users

Oleg.

 fs/proc/base.c     |   36 ++++++++------
 fs/proc/internal.h |    3 +
 fs/proc/task_mmu.c |  136 ++++++++++++++++++++++++++-------------------------
 3 files changed, 93 insertions(+), 82 deletions(-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map()
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
@ 2014-08-05 19:46 ` Oleg Nesterov
  2014-08-06  9:52   ` Kirill A. Shutemov
  2014-08-05 19:46 ` [PATCH v2 2/7] fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open() Oleg Nesterov
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-05 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be
NULL or it can changed by exec right after mm_access().

And in theory this race is not harmless, the task can exec and then
later exit and free the new mm_struct. In this case get_task_mm(oldmm)
can't help, get_gate_vma(task->mm) can read the freed/unmapped memory.

I think that priv->task should simply die and hold_task_mempolicy()
logic can be simplified. tail_vma logic asks for cleanups too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cfa63ee..8a5271e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -170,7 +170,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 		return mm;
 	down_read(&mm->mmap_sem);
 
-	tail_vma = get_gate_vma(priv->task->mm);
+	tail_vma = get_gate_vma(mm);
 	priv->tail_vma = tail_vma;
 	hold_task_mempolicy(priv);
 	/* Start with last addr hint */
@@ -351,12 +351,11 @@ static int show_map(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
 	struct proc_maps_private *priv = m->private;
-	struct task_struct *task = priv->task;
 
 	show_map_vma(m, vma, is_pid);
 
 	if (m->count < m->size)  /* vma is copied successfully */
-		m->version = (vma != get_gate_vma(task->mm))
+		m->version = (vma != priv->tail_vma)
 			? vma->vm_start : 0;
 	return 0;
 }
@@ -584,7 +583,6 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct proc_maps_private *priv = m->private;
-	struct task_struct *task = priv->task;
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
@@ -639,7 +637,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 	show_smap_vma_flags(m, vma);
 
 	if (m->count < m->size)  /* vma is copied successfully */
-		m->version = (vma != get_gate_vma(task->mm))
+		m->version = (vma != priv->tail_vma)
 			? vma->vm_start : 0;
 	return 0;
 }
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 2/7] fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open()
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
  2014-08-05 19:46 ` [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map() Oleg Nesterov
@ 2014-08-05 19:46 ` Oleg Nesterov
  2014-08-06  9:55   ` Kirill A. Shutemov
  2014-08-05 19:46 ` [PATCH v2 3/7] proc: introduce proc_mem_open() Oleg Nesterov
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-05 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

do_maps_open() and numa_maps_open() are overcomplicated, they could
use __seq_open_private(). Plus they do the same, just sizeof(*priv)
differs.

Change them to use a new simple helper, proc_maps_open(ops, psize).
This simplifies the code and allows us to do the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |   44 ++++++++++++++++----------------------------
 1 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a5271e..794aeb6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -231,23 +231,23 @@ static void m_stop(struct seq_file *m, void *v)
 		put_task_struct(priv->task);
 }
 
+static int proc_maps_open(struct inode *inode, struct file *file,
+			const struct seq_operations *ops, int psize)
+{
+	struct proc_maps_private *priv = __seq_open_private(file, ops, psize);
+
+	if (!priv)
+		return -ENOMEM;
+
+	priv->pid = proc_pid(inode);
+	return 0;
+}
+
 static int do_maps_open(struct inode *inode, struct file *file,
 			const struct seq_operations *ops)
 {
-	struct proc_maps_private *priv;
-	int ret = -ENOMEM;
-	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
-	if (priv) {
-		priv->pid = proc_pid(inode);
-		ret = seq_open(file, ops);
-		if (!ret) {
-			struct seq_file *m = file->private_data;
-			m->private = priv;
-		} else {
-			kfree(priv);
-		}
-	}
-	return ret;
+	return proc_maps_open(inode, file, ops,
+				sizeof(struct proc_maps_private));
 }
 
 static void
@@ -1502,20 +1502,8 @@ static const struct seq_operations proc_tid_numa_maps_op = {
 static int numa_maps_open(struct inode *inode, struct file *file,
 			  const struct seq_operations *ops)
 {
-	struct numa_maps_private *priv;
-	int ret = -ENOMEM;
-	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
-	if (priv) {
-		priv->proc_maps.pid = proc_pid(inode);
-		ret = seq_open(file, ops);
-		if (!ret) {
-			struct seq_file *m = file->private_data;
-			m->private = priv;
-		} else {
-			kfree(priv);
-		}
-	}
-	return ret;
+	return proc_maps_open(inode, file, ops,
+				sizeof(struct numa_maps_private));
 }
 
 static int pid_numa_maps_open(struct inode *inode, struct file *file)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 3/7] proc: introduce proc_mem_open()
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
  2014-08-05 19:46 ` [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map() Oleg Nesterov
  2014-08-05 19:46 ` [PATCH v2 2/7] fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open() Oleg Nesterov
@ 2014-08-05 19:46 ` Oleg Nesterov
  2014-08-06  9:57   ` Kirill A. Shutemov
  2014-08-05 19:46 ` [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open() Oleg Nesterov
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-05 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Extract the mm_access() code from __mem_open() into the new helper,
proc_mem_open(), the next patch will add another caller.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/base.c     |   36 +++++++++++++++++++++---------------
 fs/proc/internal.h |    2 ++
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2d696b0..9aef9ac 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -667,29 +667,35 @@ static const struct file_operations proc_single_file_operations = {
 	.release	= single_release,
 };
 
-static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
+
+struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode)
 {
-	struct task_struct *task = get_proc_task(file_inode(file));
-	struct mm_struct *mm;
+	struct task_struct *task = get_proc_task(inode);
+	struct mm_struct *mm = ERR_PTR(-ESRCH);
 
-	if (!task)
-		return -ESRCH;
+	if (task) {
+		mm = mm_access(task, mode);
+		put_task_struct(task);
 
-	mm = mm_access(task, mode);
-	put_task_struct(task);
+		if (!IS_ERR_OR_NULL(mm)) {
+			/* ensure this mm_struct can't be freed */
+			atomic_inc(&mm->mm_count);
+			/* but do not pin its memory */
+			mmput(mm);
+		}
+	}
+
+	return mm;
+}
+
+static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
+{
+	struct mm_struct *mm = proc_mem_open(inode, mode);
 
 	if (IS_ERR(mm))
 		return PTR_ERR(mm);
 
-	if (mm) {
-		/* ensure this mm_struct can't be freed */
-		atomic_inc(&mm->mm_count);
-		/* but do not pin its memory */
-		mmput(mm);
-	}
-
 	file->private_data = mm;
-
 	return 0;
 }
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 3ab6d14..ba0c1c1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -275,6 +275,8 @@ struct proc_maps_private {
 #endif
 };
 
+struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode);
+
 extern const struct file_operations proc_pid_maps_operations;
 extern const struct file_operations proc_tid_maps_operations;
 extern const struct file_operations proc_pid_numa_maps_operations;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (2 preceding siblings ...)
  2014-08-05 19:46 ` [PATCH v2 3/7] proc: introduce proc_mem_open() Oleg Nesterov
@ 2014-08-05 19:46 ` Oleg Nesterov
  2014-08-06 10:05   ` Kirill A. Shutemov
  2014-12-03 14:14   ` Kirill A. Shutemov
  2014-08-05 19:46 ` [PATCH v2 5/7] fs/proc/task_mmu.c: simplify the vma_stop() logic Oleg Nesterov
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-05 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

A simple test-case from Kirill Shutemov

	cat /proc/self/maps >/dev/null
	chmod +x /proc/self/net/packet
	exec /proc/self/net/packet

makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
the opposite order.

It's a false positive and probably we should not allow "chmod +x" on proc
files. Still I think that we should avoid mm_access() and cred_guard_mutex
in sys_read() paths, security checking should happen at open time. Besides,
this doesn't even look right if the task changes its ->mm between m_stop()
and m_start().

Add the new "mm_struct *mm" member into struct proc_maps_private and change
proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
otherwise.

The only complication is that proc_maps_open() users should additionally do
mmdrop() in fop->release(), add the new proc_map_release() helper for that.

Note: this is the user-visible change, if the task execs after open("maps")
the new ->mm won't be visible via this file. I hope this is fine, and this
matches /proc/pid/mem bahaviour.

Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/internal.h |    1 +
 fs/proc/task_mmu.c |   37 ++++++++++++++++++++++++++++---------
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index ba0c1c1..78784cd 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -267,6 +267,7 @@ extern int proc_remount(struct super_block *, int *, char *);
 struct proc_maps_private {
 	struct pid *pid;
 	struct task_struct *task;
+	struct mm_struct *mm;
 #ifdef CONFIG_MMU
 	struct vm_area_struct *tail_vma;
 #endif
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 794aeb6..7ec8eb5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -165,9 +165,9 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	if (!priv->task)
 		return ERR_PTR(-ESRCH);
 
-	mm = mm_access(priv->task, PTRACE_MODE_READ);
-	if (!mm || IS_ERR(mm))
-		return mm;
+	mm = priv->mm;
+	if (!mm || !atomic_inc_not_zero(&mm->mm_users))
+		return NULL;
 	down_read(&mm->mmap_sem);
 
 	tail_vma = get_gate_vma(mm);
@@ -240,9 +240,28 @@ static int proc_maps_open(struct inode *inode, struct file *file,
 		return -ENOMEM;
 
 	priv->pid = proc_pid(inode);
+	priv->mm = proc_mem_open(inode, PTRACE_MODE_READ);
+	if (IS_ERR(priv->mm)) {
+		int err = PTR_ERR(priv->mm);
+		seq_release_private(inode, file);
+		return err;
+	}
+
 	return 0;
 }
 
+static int proc_map_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq = file->private_data;
+	struct proc_maps_private *priv = seq->private;
+
+	if (priv->mm)
+		mmdrop(priv->mm);
+
+	return seq_release_private(inode, file);
+}
+
+
 static int do_maps_open(struct inode *inode, struct file *file,
 			const struct seq_operations *ops)
 {
@@ -398,14 +417,14 @@ const struct file_operations proc_pid_maps_operations = {
 	.open		= pid_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release_private,
+	.release	= proc_map_release,
 };
 
 const struct file_operations proc_tid_maps_operations = {
 	.open		= tid_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release_private,
+	.release	= proc_map_release,
 };
 
 /*
@@ -680,14 +699,14 @@ const struct file_operations proc_pid_smaps_operations = {
 	.open		= pid_smaps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release_private,
+	.release	= proc_map_release,
 };
 
 const struct file_operations proc_tid_smaps_operations = {
 	.open		= tid_smaps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release_private,
+	.release	= proc_map_release,
 };
 
 /*
@@ -1520,13 +1539,13 @@ const struct file_operations proc_pid_numa_maps_operations = {
 	.open		= pid_numa_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release_private,
+	.release	= proc_map_release,
 };
 
 const struct file_operations proc_tid_numa_maps_operations = {
 	.open		= tid_numa_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release_private,
+	.release	= proc_map_release,
 };
 #endif /* CONFIG_NUMA */
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 5/7] fs/proc/task_mmu.c: simplify the vma_stop() logic
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (3 preceding siblings ...)
  2014-08-05 19:46 ` [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open() Oleg Nesterov
@ 2014-08-05 19:46 ` Oleg Nesterov
  2014-08-06 10:12   ` Kirill A. Shutemov
  2014-08-05 19:47 ` [PATCH v2 6/7] fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next() Oleg Nesterov
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-05 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall
vma. This is because in this case m_stop()->vma_stop() obviously
can't use gate_vma->vm_mm.

Now that we have proc_maps_private->mm we can simplify this logic:

  - Change m_start() to return with ->mmap_sem held unless it returns
    IS_ERR_OR_NULL().

  - Change vma_stop() to use priv->mm and avoid the ugly vma checks,
    this makes "vm_area_struct *vma" unnecessary.

  - This also allows m_start() to use vm_stop().

  - Cleanup m_next() to follow the new locking rule.

    Note: m_stop() looks very ugly, and this temporary uglifies it
    even more. Fixed by the next change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |   34 ++++++++++++++++++----------------
 1 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7ec8eb5..f1254b4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -129,14 +129,12 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
 }
 #endif
 
-static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
+static void vma_stop(struct proc_maps_private *priv)
 {
-	if (vma && vma != priv->tail_vma) {
-		struct mm_struct *mm = vma->vm_mm;
-		release_task_mempolicy(priv);
-		up_read(&mm->mmap_sem);
-		mmput(mm);
-	}
+	struct mm_struct *mm = priv->mm;
+	release_task_mempolicy(priv);
+	up_read(&mm->mmap_sem);
+	mmput(mm);
 }
 
 static void *m_start(struct seq_file *m, loff_t *pos)
@@ -199,12 +197,13 @@ out:
 	if (vma)
 		return vma;
 
-	release_task_mempolicy(priv);
 	/* End of vmas has been reached */
 	m->version = (tail_vma != NULL)? 0: -1UL;
-	up_read(&mm->mmap_sem);
-	mmput(mm);
-	return tail_vma;
+	if (tail_vma)
+		return tail_vma;
+
+	vma_stop(priv);
+	return NULL;
 }
 
 static void *m_next(struct seq_file *m, void *v, loff_t *pos)
@@ -212,21 +211,24 @@ static void *m_next(struct seq_file *m, void *v, loff_t *pos)
 	struct proc_maps_private *priv = m->private;
 	struct vm_area_struct *vma = v;
 	struct vm_area_struct *tail_vma = priv->tail_vma;
+	struct vm_area_struct *next;
 
 	(*pos)++;
 	if (vma && (vma != tail_vma) && vma->vm_next)
 		return vma->vm_next;
-	vma_stop(priv, vma);
-	return (vma != tail_vma)? tail_vma: NULL;
+
+	next = (vma != tail_vma)? tail_vma: NULL;
+	if (!next)
+		vma_stop(priv);
+	return next;
 }
 
 static void m_stop(struct seq_file *m, void *v)
 {
 	struct proc_maps_private *priv = m->private;
-	struct vm_area_struct *vma = v;
 
-	if (!IS_ERR(vma))
-		vma_stop(priv, vma);
+	if (!IS_ERR_OR_NULL(v))
+		vma_stop(priv);
 	if (priv->task)
 		put_task_struct(priv->task);
 }
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 6/7] fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next()
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (4 preceding siblings ...)
  2014-08-05 19:46 ` [PATCH v2 5/7] fs/proc/task_mmu.c: simplify the vma_stop() logic Oleg Nesterov
@ 2014-08-05 19:47 ` Oleg Nesterov
  2014-08-06 10:17   ` Kirill A. Shutemov
  2014-08-05 19:47 ` [PATCH v2 7/7] fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop() Oleg Nesterov
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-05 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

1. Kill the first "vma != NULL" check. Firstly this is not possible,
   m_next() won't be called if ->start() or the previous ->next()
   returns NULL.

   And if it was possible the 2nd "vma != tail_vma" check is buggy,
   we should not wrongly return ->tail_vma.

2. Make this function readable. The logic is very simple, we should
   return check "vma != tail" once and return "vm_next || tail_vma".

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f1254b4..c4c8825 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -209,15 +209,13 @@ out:
 static void *m_next(struct seq_file *m, void *v, loff_t *pos)
 {
 	struct proc_maps_private *priv = m->private;
-	struct vm_area_struct *vma = v;
 	struct vm_area_struct *tail_vma = priv->tail_vma;
-	struct vm_area_struct *next;
+	struct vm_area_struct *vma = v, *next = NULL;
 
 	(*pos)++;
-	if (vma && (vma != tail_vma) && vma->vm_next)
-		return vma->vm_next;
+	if (vma != tail_vma)
+		next = vma->vm_next ?: tail_vma;
 
-	next = (vma != tail_vma)? tail_vma: NULL;
 	if (!next)
 		vma_stop(priv);
 	return next;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 7/7] fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop()
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (5 preceding siblings ...)
  2014-08-05 19:47 ` [PATCH v2 6/7] fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next() Oleg Nesterov
@ 2014-08-05 19:47 ` Oleg Nesterov
  2014-08-06 10:18   ` Kirill A. Shutemov
  2014-08-06 10:19 ` [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Kirill A. Shutemov
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-05 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

1. There is no reason to reset ->tail_vma in m_start(), if we return
   IS_ERR_OR_NULL() it won't be used.

2. m_start() also clears priv->task to ensure that m_stop() won't use
   the stale pointer if we fail before get_task_struct(). But this is
   ugly and confusing, move this initialization in m_stop().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c4c8825..5f7fb45 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -145,17 +145,12 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	struct vm_area_struct *vma, *tail_vma = NULL;
 	loff_t l = *pos;
 
-	/* Clear the per syscall fields in priv */
-	priv->task = NULL;
-	priv->tail_vma = NULL;
-
 	/*
 	 * We remember last_addr rather than next_addr to hit with
 	 * vmacache most of the time. We have zero last_addr at
 	 * the beginning and also after lseek. We will have -1 last_addr
 	 * after the end of the vmas.
 	 */
-
 	if (last_addr == -1UL)
 		return NULL;
 
@@ -227,8 +222,10 @@ static void m_stop(struct seq_file *m, void *v)
 
 	if (!IS_ERR_OR_NULL(v))
 		vma_stop(priv);
-	if (priv->task)
+	if (priv->task) {
 		put_task_struct(priv->task);
+		priv->task = NULL;
+	}
 }
 
 static int proc_maps_open(struct inode *inode, struct file *file,
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map()
  2014-08-05 19:46 ` [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map() Oleg Nesterov
@ 2014-08-06  9:52   ` Kirill A. Shutemov
  2014-08-06 14:10     ` Oleg Nesterov
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-08-06  9:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:46:44PM +0200, Oleg Nesterov wrote:
> get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be
> NULL or it can changed by exec right after mm_access().
> 
> And in theory this race is not harmless, the task can exec and then
> later exit and free the new mm_struct. In this case get_task_mm(oldmm)
> can't help, get_gate_vma(task->mm) can read the freed/unmapped memory.
> 
> I think that priv->task should simply die and hold_task_mempolicy()
> logic can be simplified. tail_vma logic asks for cleanups too.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  fs/proc/task_mmu.c |    8 +++-----
>  1 files changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index cfa63ee..8a5271e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -170,7 +170,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
>  		return mm;
>  	down_read(&mm->mmap_sem);
>  
> -	tail_vma = get_gate_vma(priv->task->mm);
> +	tail_vma = get_gate_vma(mm);
>  	priv->tail_vma = tail_vma;
>  	hold_task_mempolicy(priv);
>  	/* Start with last addr hint */
> @@ -351,12 +351,11 @@ static int show_map(struct seq_file *m, void *v, int is_pid)
>  {
>  	struct vm_area_struct *vma = v;
>  	struct proc_maps_private *priv = m->private;
> -	struct task_struct *task = priv->task;
>  
>  	show_map_vma(m, vma, is_pid);
>  
>  	if (m->count < m->size)  /* vma is copied successfully */
> -		m->version = (vma != get_gate_vma(task->mm))
> +		m->version = (vma != priv->tail_vma)
>  			? vma->vm_start : 0;

Drop excessive parenthesis while you're there? And in the next hunk.

Otherwise:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

>  	return 0;
>  }
> @@ -584,7 +583,6 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>  	struct proc_maps_private *priv = m->private;
> -	struct task_struct *task = priv->task;
>  	struct vm_area_struct *vma = v;
>  	struct mem_size_stats mss;
>  	struct mm_walk smaps_walk = {
> @@ -639,7 +637,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  	show_smap_vma_flags(m, vma);
>  
>  	if (m->count < m->size)  /* vma is copied successfully */
> -		m->version = (vma != get_gate_vma(task->mm))
> +		m->version = (vma != priv->tail_vma)
>  			? vma->vm_start : 0;
>  	return 0;
>  }
> -- 
> 1.5.5.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/7] fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open()
  2014-08-05 19:46 ` [PATCH v2 2/7] fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open() Oleg Nesterov
@ 2014-08-06  9:55   ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-08-06  9:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:46:48PM +0200, Oleg Nesterov wrote:
> do_maps_open() and numa_maps_open() are overcomplicated, they could
> use __seq_open_private(). Plus they do the same, just sizeof(*priv)
> differs.
> 
> Change them to use a new simple helper, proc_maps_open(ops, psize).
> This simplifies the code and allows us to do the next changes.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/7] proc: introduce proc_mem_open()
  2014-08-05 19:46 ` [PATCH v2 3/7] proc: introduce proc_mem_open() Oleg Nesterov
@ 2014-08-06  9:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-08-06  9:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:46:52PM +0200, Oleg Nesterov wrote:
> Extract the mm_access() code from __mem_open() into the new helper,
> proc_mem_open(), the next patch will add another caller.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
  2014-08-05 19:46 ` [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open() Oleg Nesterov
@ 2014-08-06 10:05   ` Kirill A. Shutemov
  2014-12-03 14:14   ` Kirill A. Shutemov
  1 sibling, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-08-06 10:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:46:55PM +0200, Oleg Nesterov wrote:
> A simple test-case from Kirill Shutemov
> 
> 	cat /proc/self/maps >/dev/null
> 	chmod +x /proc/self/net/packet
> 	exec /proc/self/net/packet
> 
> makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
> the opposite order.
> 
> It's a false positive and probably we should not allow "chmod +x" on proc
> files. Still I think that we should avoid mm_access() and cred_guard_mutex
> in sys_read() paths, security checking should happen at open time. Besides,
> this doesn't even look right if the task changes its ->mm between m_stop()
> and m_start().
> 
> Add the new "mm_struct *mm" member into struct proc_maps_private and change
> proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
> use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
> otherwise.
> 
> The only complication is that proc_maps_open() users should additionally do
> mmdrop() in fop->release(), add the new proc_map_release() helper for that.
> 
> Note: this is the user-visible change, if the task execs after open("maps")
> the new ->mm won't be visible via this file. I hope this is fine, and this
> matches /proc/pid/mem bahaviour.
> 
> Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/7] fs/proc/task_mmu.c: simplify the vma_stop() logic
  2014-08-05 19:46 ` [PATCH v2 5/7] fs/proc/task_mmu.c: simplify the vma_stop() logic Oleg Nesterov
@ 2014-08-06 10:12   ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-08-06 10:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:46:59PM +0200, Oleg Nesterov wrote:
> m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall
> vma. This is because in this case m_stop()->vma_stop() obviously
> can't use gate_vma->vm_mm.
> 
> Now that we have proc_maps_private->mm we can simplify this logic:
> 
>   - Change m_start() to return with ->mmap_sem held unless it returns
>     IS_ERR_OR_NULL().
> 
>   - Change vma_stop() to use priv->mm and avoid the ugly vma checks,
>     this makes "vm_area_struct *vma" unnecessary.
> 
>   - This also allows m_start() to use vm_stop().
> 
>   - Cleanup m_next() to follow the new locking rule.
> 
>     Note: m_stop() looks very ugly, and this temporary uglifies it
>     even more. Fixed by the next change.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/7] fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next()
  2014-08-05 19:47 ` [PATCH v2 6/7] fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next() Oleg Nesterov
@ 2014-08-06 10:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-08-06 10:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:47:03PM +0200, Oleg Nesterov wrote:
> 1. Kill the first "vma != NULL" check. Firstly this is not possible,
>    m_next() won't be called if ->start() or the previous ->next()
>    returns NULL.
> 
>    And if it was possible the 2nd "vma != tail_vma" check is buggy,
>    we should not wrongly return ->tail_vma.
> 
> 2. Make this function readable. The logic is very simple, we should
>    return check "vma != tail" once and return "vm_next || tail_vma".
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/7] fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop()
  2014-08-05 19:47 ` [PATCH v2 7/7] fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop() Oleg Nesterov
@ 2014-08-06 10:18   ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-08-06 10:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:47:07PM +0200, Oleg Nesterov wrote:
> 1. There is no reason to reset ->tail_vma in m_start(), if we return
>    IS_ERR_OR_NULL() it won't be used.
> 
> 2. m_start() also clears priv->task to ensure that m_stop() won't use
>    the stale pointer if we fail before get_task_struct(). But this is
>    ugly and confusing, move this initialization in m_stop().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (6 preceding siblings ...)
  2014-08-05 19:47 ` [PATCH v2 7/7] fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop() Oleg Nesterov
@ 2014-08-06 10:19 ` Kirill A. Shutemov
  2014-08-06 18:48 ` Cyrill Gorcunov
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-08-06 10:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:46:28PM +0200, Oleg Nesterov wrote:
> Changes:
> 
> 	1-4: Update the changelogs, join "introduce the stable
> 	     proc_maps_private->mm" and "change m_start() to rely on
> 	     priv->mm" into a single "shift "priv->task = NULL" from
> 	     m_start() to m_stop()".
> 
> 	     Resulting code is the same. Kirill and Cyrill, you seem
> 	     to agree with these changes, I'll appreciate your acks.
> 
> 	5-7: New. Seems to work, please review.

Thanks! Looks great!

Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map()
  2014-08-06  9:52   ` Kirill A. Shutemov
@ 2014-08-06 14:10     ` Oleg Nesterov
  2014-08-06 19:41       ` Oleg Nesterov
  0 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-06 14:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On 08/06, Kirill A. Shutemov wrote:
>
> On Tue, Aug 05, 2014 at 09:46:44PM +0200, Oleg Nesterov wrote:
> >
> >  	if (m->count < m->size)  /* vma is copied successfully */
> > -		m->version = (vma != get_gate_vma(task->mm))
> > +		m->version = (vma != priv->tail_vma)
> >  			? vma->vm_start : 0;
>
> Drop excessive parenthesis while you're there? And in the next hunk.

I agree. But lets do this in another patch? I forgot to mention this in
the yesterday's TODO, but we should factor out this code. Perhaps move
this into vma_stop(). Or at least add a trivial helper, this pattern
repeats 3 times, plus another one in m_start(). This problem is that I
do not really understand this logic right now, but I'll do something
today in any case.

> Otherwise:
>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (7 preceding siblings ...)
  2014-08-06 10:19 ` [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Kirill A. Shutemov
@ 2014-08-06 18:48 ` Cyrill Gorcunov
  2014-08-07 19:17 ` [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess Oleg Nesterov
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Cyrill Gorcunov @ 2014-08-06 18:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Alexander Viro, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On Tue, Aug 05, 2014 at 09:46:28PM +0200, Oleg Nesterov wrote:
> Changes:
> 
> 	1-4: Update the changelogs, join "introduce the stable
> 	     proc_maps_private->mm" and "change m_start() to rely on
> 	     priv->mm" into a single "shift "priv->task = NULL" from
> 	     m_start() to m_stop()".
> 
> 	     Resulting code is the same. Kirill and Cyrill, you seem
> 	     to agree with these changes, I'll appreciate your acks.
> 
> 	5-7: New. Seems to work, please review.
> 
> Todo:
> 
> 	- Cleanup the tail_vma horror in m_start
> 
> 	- Update task_nommu.c in the same way
> 
> 	- Fix lock_trace() users
> 
> Oleg.

(Sorry for delay) Thanks a lot, Oleg!
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
to whole series

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map()
  2014-08-06 14:10     ` Oleg Nesterov
@ 2014-08-06 19:41       ` Oleg Nesterov
  0 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-06 19:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On 08/06, Oleg Nesterov wrote:
>
> On 08/06, Kirill A. Shutemov wrote:
> >
> > On Tue, Aug 05, 2014 at 09:46:44PM +0200, Oleg Nesterov wrote:
> > >
> > >  	if (m->count < m->size)  /* vma is copied successfully */
> > > -		m->version = (vma != get_gate_vma(task->mm))
> > > +		m->version = (vma != priv->tail_vma)
> > >  			? vma->vm_start : 0;
> >
> > Drop excessive parenthesis while you're there? And in the next hunk.
>
> I agree. But lets do this in another patch? I forgot to mention this in
> the yesterday's TODO, but we should factor out this code. Perhaps move
> this into vma_stop(). Or at least add a trivial helper, this pattern
> repeats 3 times, plus another one in m_start(). This problem is that I
> do not really understand this logic right now, but I'll do something
> today in any case.

Damn, this needs even more cleanups than I thought. Even "version == -1"
logic is absolutely wrong/dead. And "last_addr rather than next_addr" is
wrong too afaics, this just means we need the extra ->vm_next step plus
vma-or-tail_vma recheck.

Hopefully I'll finish this tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (8 preceding siblings ...)
  2014-08-06 18:48 ` Cyrill Gorcunov
@ 2014-08-07 19:17 ` Oleg Nesterov
  2014-08-07 19:17   ` [PATCH 1/5] fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic Oleg Nesterov
                     ` (4 more replies)
  2014-08-11 17:00 ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Oleg Nesterov
                   ` (2 subsequent siblings)
  12 siblings, 5 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-07 19:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

On 08/05, Oleg Nesterov wrote:
>
> Todo:
>
> 	- Cleanup the tail_vma horror in m_start

On top of the previous "[PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups"
series.

It took me much more time than I expected ;) And after 3 attempts I failed
to fix this (imho) horror step-by-step. So 1/5 simply removes the last_addr
code, and then 4/5 brings it back.

Todo:

	- Update task_nommu.c in the same way

	- Fix lock_trace() users

Oleg.

 fs/proc/task_mmu.c |   98 +++++++++++++++++++++------------------------------
 1 files changed, 40 insertions(+), 58 deletions(-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/5] fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic
  2014-08-07 19:17 ` [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess Oleg Nesterov
@ 2014-08-07 19:17   ` Oleg Nesterov
  2014-08-07 19:17   ` [PATCH 2/5] fs/proc/task_mmu.c: simplify m_start() to make it readable Oleg Nesterov
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-07 19:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

m_start() carefully documents, checks, and sets "m->version = -1" if
we are going to return NULL. The only problem is that we will be never
called again if m_start() returns NULL, so this is simply pointless
and misleading.

Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is
just wrong, we want -1 in this case. And in fact we also want -1 if
->vm_next == NULL and ->tail_vma == NULL.

And it is not used consistently, the "scan vmas" loop in m_start()
should update last_addr too.

Finally, imo the whole "last_addr" logic in m_start() looks horrible.
find_vma(last_addr) is called unconditionally even if we are not going
to use the result. But the main problem is that this code participates
in tail_vma-or-NULL mess, and this looks simply unfixable.

Remove this optimization. We will add it back after some cleanups.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |   35 +----------------------------------
 1 files changed, 1 insertions(+), 34 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5f7fb45..d69f31c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -140,20 +140,10 @@ static void vma_stop(struct proc_maps_private *priv)
 static void *m_start(struct seq_file *m, loff_t *pos)
 {
 	struct proc_maps_private *priv = m->private;
-	unsigned long last_addr = m->version;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma, *tail_vma = NULL;
 	loff_t l = *pos;
 
-	/*
-	 * We remember last_addr rather than next_addr to hit with
-	 * vmacache most of the time. We have zero last_addr at
-	 * the beginning and also after lseek. We will have -1 last_addr
-	 * after the end of the vmas.
-	 */
-	if (last_addr == -1UL)
-		return NULL;
-
 	priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
 	if (!priv->task)
 		return ERR_PTR(-ESRCH);
@@ -166,12 +156,6 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	tail_vma = get_gate_vma(mm);
 	priv->tail_vma = tail_vma;
 	hold_task_mempolicy(priv);
-	/* Start with last addr hint */
-	vma = find_vma(mm, last_addr);
-	if (last_addr && vma) {
-		vma = vma->vm_next;
-		goto out;
-	}
 
 	/*
 	 * Check the vma index is within the range and do
@@ -192,8 +176,6 @@ out:
 	if (vma)
 		return vma;
 
-	/* End of vmas has been reached */
-	m->version = (tail_vma != NULL)? 0: -1UL;
 	if (tail_vma)
 		return tail_vma;
 
@@ -365,14 +347,7 @@ done:
 
 static int show_map(struct seq_file *m, void *v, int is_pid)
 {
-	struct vm_area_struct *vma = v;
-	struct proc_maps_private *priv = m->private;
-
-	show_map_vma(m, vma, is_pid);
-
-	if (m->count < m->size)  /* vma is copied successfully */
-		m->version = (vma != priv->tail_vma)
-			? vma->vm_start : 0;
+	show_map_vma(m, v, is_pid);
 	return 0;
 }
 
@@ -598,7 +573,6 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
-	struct proc_maps_private *priv = m->private;
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
@@ -651,10 +625,6 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 				mss.nonlinear >> 10);
 
 	show_smap_vma_flags(m, vma);
-
-	if (m->count < m->size)  /* vma is copied successfully */
-		m->version = (vma != priv->tail_vma)
-			? vma->vm_start : 0;
 	return 0;
 }
 
@@ -1485,9 +1455,6 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 			seq_printf(m, " N%d=%lu", nid, md->node[nid]);
 out:
 	seq_putc(m, '\n');
-
-	if (m->count < m->size)
-		m->version = (vma != proc_priv->tail_vma) ? vma->vm_start : 0;
 	return 0;
 }
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/5] fs/proc/task_mmu.c: simplify m_start() to make it readable
  2014-08-07 19:17 ` [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess Oleg Nesterov
  2014-08-07 19:17   ` [PATCH 1/5] fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic Oleg Nesterov
@ 2014-08-07 19:17   ` Oleg Nesterov
  2014-08-07 19:17   ` [PATCH 3/5] fs/proc/task_mmu.c: introduce m_next_vma() helper Oleg Nesterov
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-07 19:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Now that m->version is gone we can cleanup m_start(). In particular,

  - Remove the "unsigned long" typecast, m->index can't be negative
    or exceed ->map_count. But lets use "unsigned int pos" to make
    it clear that "pos < map_count" is safe.

  - Remove the unnecessary "vma != NULL" check in the main loop. It
    can't be NULL unless we have a vm bug.

  - This also means that "pos < map_count" case can simply return the
    valid vma and avoid "goto" and subsequent checks.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |   34 ++++++++++------------------------
 1 files changed, 10 insertions(+), 24 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d69f31c..1349d3d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -137,12 +137,12 @@ static void vma_stop(struct proc_maps_private *priv)
 	mmput(mm);
 }
 
-static void *m_start(struct seq_file *m, loff_t *pos)
+static void *m_start(struct seq_file *m, loff_t *ppos)
 {
 	struct proc_maps_private *priv = m->private;
 	struct mm_struct *mm;
-	struct vm_area_struct *vma, *tail_vma = NULL;
-	loff_t l = *pos;
+	struct vm_area_struct *vma;
+	unsigned int pos = *ppos;
 
 	priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
 	if (!priv->task)
@@ -151,33 +151,19 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	mm = priv->mm;
 	if (!mm || !atomic_inc_not_zero(&mm->mm_users))
 		return NULL;
-	down_read(&mm->mmap_sem);
 
-	tail_vma = get_gate_vma(mm);
-	priv->tail_vma = tail_vma;
+	down_read(&mm->mmap_sem);
 	hold_task_mempolicy(priv);
+	priv->tail_vma = get_gate_vma(mm);
 
-	/*
-	 * Check the vma index is within the range and do
-	 * sequential scan until m_index.
-	 */
-	vma = NULL;
-	if ((unsigned long)l < mm->map_count) {
-		vma = mm->mmap;
-		while (l-- && vma)
+	if (pos < mm->map_count) {
+		for (vma = mm->mmap; pos; pos--)
 			vma = vma->vm_next;
-		goto out;
-	}
-
-	if (l != mm->map_count)
-		tail_vma = NULL; /* After gate vma */
-
-out:
-	if (vma)
 		return vma;
+	}
 
-	if (tail_vma)
-		return tail_vma;
+	if (pos == mm->map_count && priv->tail_vma)
+		return priv->tail_vma;
 
 	vma_stop(priv);
 	return NULL;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/5] fs/proc/task_mmu.c: introduce m_next_vma() helper
  2014-08-07 19:17 ` [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess Oleg Nesterov
  2014-08-07 19:17   ` [PATCH 1/5] fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic Oleg Nesterov
  2014-08-07 19:17   ` [PATCH 2/5] fs/proc/task_mmu.c: simplify m_start() to make it readable Oleg Nesterov
@ 2014-08-07 19:17   ` Oleg Nesterov
  2014-08-07 19:17   ` [PATCH 4/5] fs/proc/task_mmu.c: reintroduce m->version logic Oleg Nesterov
  2014-08-07 19:18   ` [PATCH 5/5] fs/proc/task_mmu.c: update m->version in the main loop in m_start() Oleg Nesterov
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-07 19:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Extract the tail_vma/vm_next calculation from m_next() into the new
trivial helper, m_next_vma().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1349d3d..2ba3b16 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -137,6 +137,14 @@ static void vma_stop(struct proc_maps_private *priv)
 	mmput(mm);
 }
 
+static struct vm_area_struct *
+m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma)
+{
+	if (vma == priv->tail_vma)
+		return NULL;
+	return vma->vm_next ?: priv->tail_vma;
+}
+
 static void *m_start(struct seq_file *m, loff_t *ppos)
 {
 	struct proc_maps_private *priv = m->private;
@@ -172,13 +180,10 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 static void *m_next(struct seq_file *m, void *v, loff_t *pos)
 {
 	struct proc_maps_private *priv = m->private;
-	struct vm_area_struct *tail_vma = priv->tail_vma;
-	struct vm_area_struct *vma = v, *next = NULL;
+	struct vm_area_struct *next;
 
 	(*pos)++;
-	if (vma != tail_vma)
-		next = vma->vm_next ?: tail_vma;
-
+	next = m_next_vma(priv, v);
 	if (!next)
 		vma_stop(priv);
 	return next;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/5] fs/proc/task_mmu.c: reintroduce m->version logic
  2014-08-07 19:17 ` [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess Oleg Nesterov
                     ` (2 preceding siblings ...)
  2014-08-07 19:17   ` [PATCH 3/5] fs/proc/task_mmu.c: introduce m_next_vma() helper Oleg Nesterov
@ 2014-08-07 19:17   ` Oleg Nesterov
  2014-08-07 19:18   ` [PATCH 5/5] fs/proc/task_mmu.c: update m->version in the main loop in m_start() Oleg Nesterov
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-07 19:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Add the "last_addr" optimization back. Like before, every ->show()
method checks !seq_overflow() and sets m->version = vma->vm_start.

However, it also checks that m_next_vma(vma) != NULL, otherwise it
sets m->version = -1 for the lockless "EOF" fast-path in m_start().

m_start() can simply do find_vma() + m_next_vma() if last_addr is
not zero, the code looks clear and simple and this case is clearly
separated from "scan vmas" path.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2ba3b16..1b897dc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -145,13 +145,24 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma)
 	return vma->vm_next ?: priv->tail_vma;
 }
 
+static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (m->count < m->size)	/* vma is copied successfully */
+		m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL;
+}
+
 static void *m_start(struct seq_file *m, loff_t *ppos)
 {
 	struct proc_maps_private *priv = m->private;
+	unsigned long last_addr = m->version;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	unsigned int pos = *ppos;
 
+	/* See m_cache_vma(). Zero at the start or after lseek. */
+	if (last_addr == -1UL)
+		return NULL;
+
 	priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
 	if (!priv->task)
 		return ERR_PTR(-ESRCH);
@@ -164,6 +175,13 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 	hold_task_mempolicy(priv);
 	priv->tail_vma = get_gate_vma(mm);
 
+	if (last_addr) {
+		vma = find_vma(mm, last_addr);
+		if (vma && (vma = m_next_vma(priv, vma)))
+			return vma;
+	}
+
+	m->version = 0;
 	if (pos < mm->map_count) {
 		for (vma = mm->mmap; pos; pos--)
 			vma = vma->vm_next;
@@ -339,6 +357,7 @@ done:
 static int show_map(struct seq_file *m, void *v, int is_pid)
 {
 	show_map_vma(m, v, is_pid);
+	m_cache_vma(m, v);
 	return 0;
 }
 
@@ -616,6 +635,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 				mss.nonlinear >> 10);
 
 	show_smap_vma_flags(m, vma);
+	m_cache_vma(m, vma);
 	return 0;
 }
 
@@ -1446,6 +1466,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 			seq_printf(m, " N%d=%lu", nid, md->node[nid]);
 out:
 	seq_putc(m, '\n');
+	m_cache_vma(m, vma);
 	return 0;
 }
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/5] fs/proc/task_mmu.c: update m->version in the main loop in m_start()
  2014-08-07 19:17 ` [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess Oleg Nesterov
                     ` (3 preceding siblings ...)
  2014-08-07 19:17   ` [PATCH 4/5] fs/proc/task_mmu.c: reintroduce m->version logic Oleg Nesterov
@ 2014-08-07 19:18   ` Oleg Nesterov
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-07 19:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Change the main loop in m_start() to update m->version. Mostly for
consistency, but this can help to avoid the same loop if the very
1st ->show() fails due to seq_overflow().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1b897dc..4ec6b72 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -183,11 +183,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 
 	m->version = 0;
 	if (pos < mm->map_count) {
-		for (vma = mm->mmap; pos; pos--)
+		for (vma = mm->mmap; pos; pos--) {
+			m->version = vma->vm_start;
 			vma = vma->vm_next;
+		}
 		return vma;
 	}
 
+	/* we do not bother to update m->version in this case */
 	if (pos == mm->map_count && priv->tail_vma)
 		return priv->tail_vma;
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (9 preceding siblings ...)
  2014-08-07 19:17 ` [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess Oleg Nesterov
@ 2014-08-11 17:00 ` Oleg Nesterov
  2014-08-11 17:00   ` [PATCH 1/3] fs/proc/task_nommu.c: change maps_open() to use __seq_open_private() Oleg Nesterov
                     ` (3 more replies)
  2014-08-20 16:49 ` [PATCH -mm 0/2] make /proc/PID/*maps* namespace friendly Oleg Nesterov
  2014-08-20 19:22 ` [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations Oleg Nesterov
  12 siblings, 4 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-11 17:00 UTC (permalink / raw)
  To: Andrew Morton, Greg Ungerer
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Greg, could you review? The changes are simple, but I am not familiar
with NOMMU and I can't test this.

Depends on

	[PATCH v2 3/7] proc: introduce proc_mem_open()
	http://marc.info/?l=linux-kernel&m=140726831328943

	[PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
	http://marc.info/?l=linux-kernel&m=140726828328929
	(only because it adds proc_maps_private->mm)

I'll also send you mbox with the previous series just in case.

Oleg.

 fs/proc/task_nommu.c |   65 ++++++++++++++++++++++++++++++--------------------
 1 files changed, 39 insertions(+), 26 deletions(-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/3] fs/proc/task_nommu.c: change maps_open() to use __seq_open_private()
  2014-08-11 17:00 ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Oleg Nesterov
@ 2014-08-11 17:00   ` Oleg Nesterov
  2014-08-11 17:00   ` [PATCH 2/3] fs/proc/task_nommu.c: shift mm_access() from m_start() to proc_maps_open() Oleg Nesterov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-11 17:00 UTC (permalink / raw)
  To: Andrew Morton, Greg Ungerer
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Cleanup and preparation. maps_open() can use __seq_open_private()
like proc_maps_open() does.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_nommu.c |   22 +++++++---------------
 1 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 678455d..c12eab3 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -268,21 +268,13 @@ static const struct seq_operations proc_tid_maps_ops = {
 static int maps_open(struct inode *inode, struct file *file,
 		     const struct seq_operations *ops)
 {
-	struct proc_maps_private *priv;
-	int ret = -ENOMEM;
-
-	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
-	if (priv) {
-		priv->pid = proc_pid(inode);
-		ret = seq_open(file, ops);
-		if (!ret) {
-			struct seq_file *m = file->private_data;
-			m->private = priv;
-		} else {
-			kfree(priv);
-		}
-	}
-	return ret;
+	struct proc_maps_private *priv = __seq_open_private(file, ops,
+					 sizeof(struct proc_maps_private));
+	if (!priv)
+		return -ENOMEM;
+
+	priv->pid = proc_pid(inode);
+	return 0;
 }
 
 static int pid_maps_open(struct inode *inode, struct file *file)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/3] fs/proc/task_nommu.c: shift mm_access() from m_start() to proc_maps_open()
  2014-08-11 17:00 ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Oleg Nesterov
  2014-08-11 17:00   ` [PATCH 1/3] fs/proc/task_nommu.c: change maps_open() to use __seq_open_private() Oleg Nesterov
@ 2014-08-11 17:00   ` Oleg Nesterov
  2014-08-11 17:00   ` [PATCH 3/3] fs/proc/task_nommu.c: don't use priv->task->mm Oleg Nesterov
  2014-08-12 13:57   ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Greg Ungerer
  3 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-11 17:00 UTC (permalink / raw)
  To: Andrew Morton, Greg Ungerer
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Copy-and-paste the changes from "fs/proc/task_mmu.c: shift mm_access()
from m_start() to proc_maps_open()" into task_nommu.c.

Change maps_open() to initialize priv->mm using proc_mem_open(), m_start()
can rely on atomic_inc_not_zero(mm_users) like task_mmu.c does.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_nommu.c |   29 ++++++++++++++++++++++++-----
 1 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index c12eab3..003f2be 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -216,11 +216,11 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	if (!priv->task)
 		return ERR_PTR(-ESRCH);
 
-	mm = mm_access(priv->task, PTRACE_MODE_READ);
-	if (!mm || IS_ERR(mm)) {
+	mm = priv->mm;
+	if (!mm || !atomic_inc_not_zero(&mm->mm_users)) {
 		put_task_struct(priv->task);
 		priv->task = NULL;
-		return mm;
+		return NULL;
 	}
 	down_read(&mm->mmap_sem);
 
@@ -274,9 +274,28 @@ static int maps_open(struct inode *inode, struct file *file,
 		return -ENOMEM;
 
 	priv->pid = proc_pid(inode);
+	priv->mm = proc_mem_open(inode, PTRACE_MODE_READ);
+	if (IS_ERR(priv->mm)) {
+		int err = PTR_ERR(priv->mm);
+		seq_release_private(inode, file);
+		return err;
+	}
+
 	return 0;
 }
 
+
+static int map_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq = file->private_data;
+	struct proc_maps_private *priv = seq->private;
+
+	if (priv->mm)
+		mmdrop(priv->mm);
+
+	return seq_release_private(inode, file);
+}
+
 static int pid_maps_open(struct inode *inode, struct file *file)
 {
 	return maps_open(inode, file, &proc_pid_maps_ops);
@@ -291,13 +310,13 @@ const struct file_operations proc_pid_maps_operations = {
 	.open		= pid_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release_private,
+	.release	= map_release,
 };
 
 const struct file_operations proc_tid_maps_operations = {
 	.open		= tid_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= seq_release_private,
+	.release	= map_release,
 };
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/3] fs/proc/task_nommu.c: don't use priv->task->mm
  2014-08-11 17:00 ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Oleg Nesterov
  2014-08-11 17:00   ` [PATCH 1/3] fs/proc/task_nommu.c: change maps_open() to use __seq_open_private() Oleg Nesterov
  2014-08-11 17:00   ` [PATCH 2/3] fs/proc/task_nommu.c: shift mm_access() from m_start() to proc_maps_open() Oleg Nesterov
@ 2014-08-11 17:00   ` Oleg Nesterov
  2014-08-12 13:57   ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Greg Ungerer
  3 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-11 17:00 UTC (permalink / raw)
  To: Andrew Morton, Greg Ungerer
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

I do not know if CONFIG_PREEMPT/SMP is possible without CONFIG_MMU
but the usage of task->mm in m_stop(). The task can exit/exec before
we take mmap_sem, in this case m_stop() can hit NULL or unlock the
wrong rw_semaphore.

Also, this code uses priv->task != NULL to decide whether we need
up_read/mmput. This is correct, but we will probably kill priv->task.
Change m_start/m_stop to rely on IS_ERR_OR_NULL() like task_mmu.c does.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_nommu.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 003f2be..e0237c1 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -217,17 +217,17 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 		return ERR_PTR(-ESRCH);
 
 	mm = priv->mm;
-	if (!mm || !atomic_inc_not_zero(&mm->mm_users)) {
-		put_task_struct(priv->task);
-		priv->task = NULL;
+	if (!mm || !atomic_inc_not_zero(&mm->mm_users))
 		return NULL;
-	}
-	down_read(&mm->mmap_sem);
 
+	down_read(&mm->mmap_sem);
 	/* start from the Nth VMA */
 	for (p = rb_first(&mm->mm_rb); p; p = rb_next(p))
 		if (n-- == 0)
 			return p;
+
+	up_read(&mm->mmap_sem);
+	mmput(mm);
 	return NULL;
 }
 
@@ -235,11 +235,13 @@ static void m_stop(struct seq_file *m, void *_vml)
 {
 	struct proc_maps_private *priv = m->private;
 
+	if (!IS_ERR_OR_NULL(_vml)) {
+		up_read(&priv->mm->mmap_sem);
+		mmput(priv->mm);
+	}
 	if (priv->task) {
-		struct mm_struct *mm = priv->task->mm;
-		up_read(&mm->mmap_sem);
-		mmput(mm);
 		put_task_struct(priv->task);
+		priv->task = NULL;
 	}
 }
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c
  2014-08-11 17:00 ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Oleg Nesterov
                     ` (2 preceding siblings ...)
  2014-08-11 17:00   ` [PATCH 3/3] fs/proc/task_nommu.c: don't use priv->task->mm Oleg Nesterov
@ 2014-08-12 13:57   ` Greg Ungerer
  2014-08-12 17:35     ` Oleg Nesterov
  3 siblings, 1 reply; 48+ messages in thread
From: Greg Ungerer @ 2014-08-12 13:57 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, David Howells, David S. Miller,
	Eric W. Biederman, Kirill A. Shutemov, Kirill A. Shutemov,
	Peter Zijlstra, Sasha Levin, linux-fsdevel, linux-kernel

Hi Oleg,

On 12/08/14 03:00, Oleg Nesterov wrote:
> Greg, could you review? The changes are simple, but I am not familiar
> with NOMMU and I can't test this.
>
> Depends on
>
> 	[PATCH v2 3/7] proc: introduce proc_mem_open()
> 	http://marc.info/?l=linux-kernel&m=140726831328943
>
> 	[PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
> 	http://marc.info/?l=linux-kernel&m=140726828328929
> 	(only because it adds proc_maps_private->mm)
>
> I'll also send you mbox with the previous series just in case.
>
> Oleg.
>
>   fs/proc/task_nommu.c |   65 ++++++++++++++++++++++++++++++--------------------
>   1 files changed, 39 insertions(+), 26 deletions(-)

I don't see any problems. Applied cleanly for me (to a 3.16-rc7 tree), 
compiled
cleanly for a non-mmu m68k target, and ran with no problems I could see.
At least I checked /proc/1/maps and that still came out fine. Is there 
anything
else I should check?

I am happy to ack it:

Acked-by: Greg Ungerer <gerg@uclinux.org>

Regards
Greg




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c
  2014-08-12 13:57   ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Greg Ungerer
@ 2014-08-12 17:35     ` Oleg Nesterov
  0 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-12 17:35 UTC (permalink / raw)
  To: Greg Ungerer
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	David S. Miller, Eric W. Biederman, Kirill A. Shutemov,
	Kirill A. Shutemov, Peter Zijlstra, Sasha Levin, linux-fsdevel,
	linux-kernel

On 08/12, Greg Ungerer wrote:
>
> Hi Oleg,
>
> On 12/08/14 03:00, Oleg Nesterov wrote:
>> Greg, could you review? The changes are simple, but I am not familiar
>> with NOMMU and I can't test this.
>>
>> Depends on
>>
>> 	[PATCH v2 3/7] proc: introduce proc_mem_open()
>> 	http://marc.info/?l=linux-kernel&m=140726831328943
>>
>> 	[PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
>> 	http://marc.info/?l=linux-kernel&m=140726828328929
>> 	(only because it adds proc_maps_private->mm)
>>
>> I'll also send you mbox with the previous series just in case.
>>
>> Oleg.
>>
>>   fs/proc/task_nommu.c |   65 ++++++++++++++++++++++++++++++--------------------
>>   1 files changed, 39 insertions(+), 26 deletions(-)
>
> I don't see any problems. Applied cleanly for me (to a 3.16-rc7 tree),
> compiled
> cleanly for a non-mmu m68k target, and ran with no problems I could see.
> At least I checked /proc/1/maps and that still came out fine. Is there
> anything
> else I should check?

Great, hopefully nothing else.

> I am happy to ack it:
>
> Acked-by: Greg Ungerer <gerg@uclinux.org>

Thanks a lot!

I'll send more cleanups once/if Andrew takes the pending patches.

Oleg.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH -mm 0/2] make /proc/PID/*maps* namespace friendly
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (10 preceding siblings ...)
  2014-08-11 17:00 ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Oleg Nesterov
@ 2014-08-20 16:49 ` Oleg Nesterov
  2014-08-20 16:49   ` [PATCH -mm 1/2] proc/maps: replace proc_maps_private->pid with "struct inode *inode" Oleg Nesterov
  2014-08-20 16:50   ` [PATCH -mm 2/2] proc/maps: make vm_is_stack() logic namespace-friendly Oleg Nesterov
  2014-08-20 19:22 ` [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations Oleg Nesterov
  12 siblings, 2 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-20 16:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman, Greg Ungerer,
	Kirill A. Shutemov, linux-kernel

Hello,

Lets continue the boring proc/maps cleanups.

Eric, I do agree that the tgid vs pid logic looks strange, but I didn't
even try to change it. Let's at least fix the namespace problems first.
This is minor, but it is shame that /proc/pid/maps still shows the pid
from the global namespace. And this is preparation for the next change.

With this change the only user of proc_maps_private->task is
hold_task_mempolicy/release_task_mempolicy. I believe this code is buggy
and should die, but this needs the changes in mm/mempolicy.c.

After that we can kill priv->task. task_nommu.c can be updated right now,
but it would be better to change it along with task_mmu.c (we have a small
complication with -ESRCH).

Oleg.

 fs/proc/internal.h   |    2 +-
 fs/proc/task_mmu.c   |   29 +++++++++++++++++++++++------
 fs/proc/task_nommu.c |   25 ++++++++++++++++++++++---
 include/linux/mm.h   |    4 ++--
 mm/util.c            |   23 ++++++++---------------
 5 files changed, 56 insertions(+), 27 deletions(-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH -mm 1/2] proc/maps: replace proc_maps_private->pid with "struct inode *inode"
  2014-08-20 16:49 ` [PATCH -mm 0/2] make /proc/PID/*maps* namespace friendly Oleg Nesterov
@ 2014-08-20 16:49   ` Oleg Nesterov
  2014-08-20 16:50   ` [PATCH -mm 2/2] proc/maps: make vm_is_stack() logic namespace-friendly Oleg Nesterov
  1 sibling, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-20 16:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman, Greg Ungerer,
	Kirill A. Shutemov, linux-kernel

m_start() can use get_proc_task() instead, and "struct inode *"
provides more potentially useful info, see the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/internal.h   |    2 +-
 fs/proc/task_mmu.c   |    4 ++--
 fs/proc/task_nommu.c |    4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index d271828..aa7a0ee 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -268,7 +268,7 @@ extern int proc_remount(struct super_block *, int *, char *);
  * task_[no]mmu.c
  */
 struct proc_maps_private {
-	struct pid *pid;
+	struct inode *inode;
 	struct task_struct *task;
 	struct mm_struct *mm;
 #ifdef CONFIG_MMU
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index aceff2d..828ff49 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -164,7 +164,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 	if (last_addr == -1UL)
 		return NULL;
 
-	priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
+	priv->task = get_proc_task(priv->inode);
 	if (!priv->task)
 		return ERR_PTR(-ESRCH);
 
@@ -231,7 +231,7 @@ static int proc_maps_open(struct inode *inode, struct file *file,
 	if (!priv)
 		return -ENOMEM;
 
-	priv->pid = proc_pid(inode);
+	priv->inode = inode;
 	priv->mm = proc_mem_open(inode, PTRACE_MODE_READ);
 	if (IS_ERR(priv->mm)) {
 		int err = PTR_ERR(priv->mm);
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 8f157e4..f1cc304 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -212,7 +212,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	loff_t n = *pos;
 
 	/* pin the task and mm whilst we play with them */
-	priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
+	priv->task = get_proc_task(priv->inode);
 	if (!priv->task)
 		return ERR_PTR(-ESRCH);
 
@@ -275,7 +275,7 @@ static int maps_open(struct inode *inode, struct file *file,
 	if (!priv)
 		return -ENOMEM;
 
-	priv->pid = proc_pid(inode);
+	priv->inode = inode;
 	priv->mm = proc_mem_open(inode, PTRACE_MODE_READ);
 	if (IS_ERR(priv->mm)) {
 		int err = PTR_ERR(priv->mm);
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH -mm 2/2] proc/maps: make vm_is_stack() logic namespace-friendly
  2014-08-20 16:49 ` [PATCH -mm 0/2] make /proc/PID/*maps* namespace friendly Oleg Nesterov
  2014-08-20 16:49   ` [PATCH -mm 1/2] proc/maps: replace proc_maps_private->pid with "struct inode *inode" Oleg Nesterov
@ 2014-08-20 16:50   ` Oleg Nesterov
  1 sibling, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-20 16:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman, Greg Ungerer,
	Kirill A. Shutemov, linux-kernel

- Rename vm_is_stack() to task_of_stack() and change it to return
  "struct task_struct *" rather than the global (and thus wrong in
  general) pid_t.

- Add the new pid_of_stack() helper which calls task_of_stack() and
  uses the right namespace to report the correct pid_t.

  Unfortunately we need to define this helper twice, in task_mmu.c
  and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
  and move at least pid_of_stack/task_of_stack there to avoid the
  code duplication.

- Change show_map_vma() and show_numa_map() to use the new helper.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c   |   25 +++++++++++++++++++++----
 fs/proc/task_nommu.c |   21 ++++++++++++++++++++-
 include/linux/mm.h   |    4 ++--
 mm/util.c            |   23 ++++++++---------------
 4 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 828ff49..9d7a9a6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -262,13 +262,31 @@ static int do_maps_open(struct inode *inode, struct file *file,
 				sizeof(struct proc_maps_private));
 }
 
+static pid_t pid_of_stack(struct proc_maps_private *priv,
+				struct vm_area_struct *vma, bool is_pid)
+{
+	struct inode *inode = priv->inode;
+	struct task_struct *task;
+	pid_t ret = 0;
+
+	rcu_read_lock();
+	task = pid_task(proc_pid(inode), PIDTYPE_PID);
+	if (task) {
+		task = task_of_stack(task, vma, is_pid);
+		if (task)
+			ret = task_pid_nr_ns(task, inode->i_sb->s_fs_info);
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static void
 show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct file *file = vma->vm_file;
 	struct proc_maps_private *priv = m->private;
-	struct task_struct *task = priv->task;
 	vm_flags_t flags = vma->vm_flags;
 	unsigned long ino = 0;
 	unsigned long long pgoff = 0;
@@ -333,8 +351,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
 			goto done;
 		}
 
-		tid = vm_is_stack(task, vma, is_pid);
-
+		tid = pid_of_stack(priv, vma, is_pid);
 		if (tid != 0) {
 			/*
 			 * Thread stack in /proc/PID/task/TID/maps or
@@ -1438,7 +1455,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	} else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
 		seq_puts(m, " heap");
 	} else {
-		pid_t tid = vm_is_stack(task, vma, is_pid);
+		pid_t tid = pid_of_stack(proc_priv, vma, is_pid);
 		if (tid != 0) {
 			/*
 			 * Thread stack in /proc/PID/task/TID/maps or
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index f1cc304..0283e5c 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -123,6 +123,25 @@ unsigned long task_statm(struct mm_struct *mm,
 	return size;
 }
 
+static pid_t pid_of_stack(struct proc_maps_private *priv,
+				struct vm_area_struct *vma, bool is_pid)
+{
+	struct inode *inode = priv->inode;
+	struct task_struct *task;
+	pid_t ret = 0;
+
+	rcu_read_lock();
+	task = pid_task(proc_pid(inode), PIDTYPE_PID);
+	if (task) {
+		task = task_of_stack(task, vma, is_pid);
+		if (task)
+			ret = task_pid_nr_ns(task, inode->i_sb->s_fs_info);
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 /*
  * display a single VMA to a sequenced file
  */
@@ -163,7 +182,7 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma,
 		seq_pad(m, ' ');
 		seq_path(m, &file->f_path, "");
 	} else if (mm) {
-		pid_t tid = vm_is_stack(priv->task, vma, is_pid);
+		pid_t tid = pid_of_stack(priv, vma, is_pid);
 
 		if (tid != 0) {
 			seq_pad(m, ' ');
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..1c05bd8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1247,8 +1247,8 @@ static inline int stack_guard_page_end(struct vm_area_struct *vma,
 		!vma_growsup(vma->vm_next, addr);
 }
 
-extern pid_t
-vm_is_stack(struct task_struct *task, struct vm_area_struct *vma, int in_group);
+extern struct task_struct *task_of_stack(struct task_struct *task,
+				struct vm_area_struct *vma, bool in_group);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
diff --git a/mm/util.c b/mm/util.c
index 093c973..fec39d4 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -170,32 +170,25 @@ static int vm_is_stack_for_task(struct task_struct *t,
 /*
  * Check if the vma is being used as a stack.
  * If is_group is non-zero, check in the entire thread group or else
- * just check in the current task. Returns the pid of the task that
- * the vma is stack for.
+ * just check in the current task. Returns the task_struct of the task
+ * that the vma is stack for. Must be called under rcu_read_lock().
  */
-pid_t vm_is_stack(struct task_struct *task,
-		  struct vm_area_struct *vma, int in_group)
+struct task_struct *task_of_stack(struct task_struct *task,
+				struct vm_area_struct *vma, bool in_group)
 {
-	pid_t ret = 0;
-
 	if (vm_is_stack_for_task(task, vma))
-		return task->pid;
+		return task;
 
 	if (in_group) {
 		struct task_struct *t;
 
-		rcu_read_lock();
 		for_each_thread(task, t) {
-			if (vm_is_stack_for_task(t, vma)) {
-				ret = t->pid;
-				goto done;
-			}
+			if (vm_is_stack_for_task(t, vma))
+				return t;
 		}
-done:
-		rcu_read_unlock();
 	}
 
-	return ret;
+	return NULL;
 }
 
 #if defined(CONFIG_MMU) && !defined(HAVE_ARCH_PICK_MMAP_LAYOUT)
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations
  2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
                   ` (11 preceding siblings ...)
  2014-08-20 16:49 ` [PATCH -mm 0/2] make /proc/PID/*maps* namespace friendly Oleg Nesterov
@ 2014-08-20 19:22 ` Oleg Nesterov
  2014-08-20 19:22   ` [PATCH 1/4] mempolicy: change alloc_pages_vma() to use mpol_cond_put() Oleg Nesterov
                     ` (4 more replies)
  12 siblings, 5 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-20 19:22 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

On 08/20, Oleg Nesterov wrote:
>
> With this change the only user of proc_maps_private->task is
> hold_task_mempolicy/release_task_mempolicy. I believe this code is buggy
> and should die, but this needs the changes in mm/mempolicy.c.

I belive that 9e7814404b77c3e8 "hold task->mempolicy while numa_maps scans"
was wrong and it doesn't fix the problem.

I'll try to send the fix tomorrow, could you review the initial preparations?
To me, these changes make sense as cleanups in any case, but I don't really
understand this code and this was only compile tested.

Oleg.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/4] mempolicy: change alloc_pages_vma() to use mpol_cond_put()
  2014-08-20 19:22 ` [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations Oleg Nesterov
@ 2014-08-20 19:22   ` Oleg Nesterov
  2014-08-20 19:23   ` [PATCH 2/4] mempolicy: change get_task_policy() to return default_policy rather than NULL Oleg Nesterov
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-20 19:22 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

Trivial cleanup. alloc_pages_vma() can use mpol_cond_put().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/mempolicy.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d..c0c5d38 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2046,8 +2046,7 @@ retry_cpuset:
 	page = __alloc_pages_nodemask(gfp, order,
 				      policy_zonelist(gfp, pol, node),
 				      policy_nodemask(gfp, pol));
-	if (unlikely(mpol_needs_cond_ref(pol)))
-		__mpol_put(pol);
+	mpol_cond_put(pol);
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 	return page;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/4] mempolicy: change get_task_policy() to return default_policy rather than NULL
  2014-08-20 19:22 ` [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations Oleg Nesterov
  2014-08-20 19:22   ` [PATCH 1/4] mempolicy: change alloc_pages_vma() to use mpol_cond_put() Oleg Nesterov
@ 2014-08-20 19:23   ` Oleg Nesterov
  2014-08-20 19:23   ` [PATCH 3/4] mempolicy: sanitize the usage of get_task_policy() Oleg Nesterov
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-20 19:23 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

Every caller of get_task_policy() falls back to default_policy if it
returns NULL. Change get_task_policy() to do this.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/mempolicy.c |   31 +++++++++++++------------------
 1 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c0c5d38..656db97 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -126,22 +126,20 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 static struct mempolicy *get_task_policy(struct task_struct *p)
 {
 	struct mempolicy *pol = p->mempolicy;
+	int node;
 
-	if (!pol) {
-		int node = numa_node_id();
+	if (pol)
+		return pol;
 
-		if (node != NUMA_NO_NODE) {
-			pol = &preferred_node_policy[node];
-			/*
-			 * preferred_node_policy is not initialised early in
-			 * boot
-			 */
-			if (!pol->mode)
-				pol = NULL;
-		}
+	node = numa_node_id();
+	if (node != NUMA_NO_NODE) {
+		pol = &preferred_node_policy[node];
+		/* preferred_node_policy is not initialised early in boot */
+		if (pol->mode)
+			return pol;
 	}
 
-	return pol;
+	return &default_policy;
 }
 
 static const struct mempolicy_operations {
@@ -1644,14 +1642,14 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 				mpol_get(pol);
 		}
 	}
-	if (!pol)
-		pol = &default_policy;
+
 	return pol;
 }
 
 bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
 {
 	struct mempolicy *pol = get_task_policy(task);
+
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
 			bool ret = false;
@@ -1667,9 +1665,6 @@ bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
 		}
 	}
 
-	if (!pol)
-		return default_policy.flags & MPOL_F_MOF;
-
 	return pol->flags & MPOL_F_MOF;
 }
 
@@ -2077,7 +2072,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 
-	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
+	if (in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
 
 retry_cpuset:
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] mempolicy: sanitize the usage of get_task_policy()
  2014-08-20 19:22 ` [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations Oleg Nesterov
  2014-08-20 19:22   ` [PATCH 1/4] mempolicy: change alloc_pages_vma() to use mpol_cond_put() Oleg Nesterov
  2014-08-20 19:23   ` [PATCH 2/4] mempolicy: change get_task_policy() to return default_policy rather than NULL Oleg Nesterov
@ 2014-08-20 19:23   ` Oleg Nesterov
  2014-08-20 19:23   ` [PATCH 4/4] mempolicy: remove the "task" arg of vma_policy_mof() and simplify it Oleg Nesterov
  2014-08-21 19:00   ` [PATCH 0/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-20 19:23 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

Cleanup + preparation. Every user of get_task_policy() calls it
unconditionally, even if it is not going to use the result.

get_task_policy() is cheap but still this does not look clean, plus
the code looks simpler if get_task_policy() is called only when this
is really needed.

Note: I hope this is correct, but it is not clear why vma_policy_mof()
doesn't fall back to get_task_policy() if ->get_policy() returns NULL.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/mempolicy.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 656db97..b86b08e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1621,14 +1621,11 @@ COMPAT_SYSCALL_DEFINE6(mbind, compat_ulong_t, start, compat_ulong_t, len,
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = get_task_policy(task);
+	struct mempolicy *pol = NULL;
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
-			struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
-									addr);
-			if (vpol)
-				pol = vpol;
+			pol = vma->vm_ops->get_policy(vma, addr);
 		} else if (vma->vm_policy) {
 			pol = vma->vm_policy;
 
@@ -1643,12 +1640,15 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 		}
 	}
 
+	if (!pol)
+		pol = get_task_policy(task);
+
 	return pol;
 }
 
 bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
 {
-	struct mempolicy *pol = get_task_policy(task);
+	struct mempolicy *pol = NULL;
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -1660,11 +1660,14 @@ bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
 			mpol_cond_put(pol);
 
 			return ret;
-		} else if (vma->vm_policy) {
-			pol = vma->vm_policy;
 		}
+
+		pol = vma->vm_policy;
 	}
 
+	if (!pol)
+		pol = get_task_policy(task);
+
 	return pol->flags & MPOL_F_MOF;
 }
 
@@ -2068,12 +2071,12 @@ retry_cpuset:
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-	struct mempolicy *pol = get_task_policy(current);
+	struct mempolicy *pol = &default_policy;
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 
-	if (in_interrupt() || (gfp & __GFP_THISNODE))
-		pol = &default_policy;
+	if (!in_interrupt() && !(gfp & __GFP_THISNODE))
+		pol = get_task_policy(current);
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/4] mempolicy: remove the "task" arg of vma_policy_mof() and simplify it
  2014-08-20 19:22 ` [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations Oleg Nesterov
                     ` (2 preceding siblings ...)
  2014-08-20 19:23   ` [PATCH 3/4] mempolicy: sanitize the usage of get_task_policy() Oleg Nesterov
@ 2014-08-20 19:23   ` Oleg Nesterov
  2014-08-21 19:00   ` [PATCH 0/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
  4 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-20 19:23 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

1. vma_policy_mof(task) is simply not safe unless task == current,
   it can race with do_exit()->mpol_put(). Remove this arg and update
   its single caller.

2. vma can not be NULL, remove this check and simplify the code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/mempolicy.h |    2 +-
 kernel/sched/fair.c       |    2 +-
 mm/mempolicy.c            |   25 +++++++++++--------------
 3 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index f230a97..5e4bfce 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -136,7 +136,7 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 
 struct mempolicy *get_vma_policy(struct task_struct *tsk,
 		struct vm_area_struct *vma, unsigned long addr);
-bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma);
+bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..82088b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1946,7 +1946,7 @@ void task_numa_work(struct callback_head *work)
 		vma = mm->mmap;
 	}
 	for (; vma; vma = vma->vm_next) {
-		if (!vma_migratable(vma) || !vma_policy_mof(p, vma))
+		if (!vma_migratable(vma) || !vma_policy_mof(vma))
 			continue;
 
 		/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b86b08e..ad27bbc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1646,27 +1646,24 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 	return pol;
 }
 
-bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma)
+bool vma_policy_mof(struct vm_area_struct *vma)
 {
-	struct mempolicy *pol = NULL;
-
-	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy) {
-			bool ret = false;
+	struct mempolicy *pol;
 
-			pol = vma->vm_ops->get_policy(vma, vma->vm_start);
-			if (pol && (pol->flags & MPOL_F_MOF))
-				ret = true;
-			mpol_cond_put(pol);
+	if (vma->vm_ops && vma->vm_ops->get_policy) {
+		bool ret = false;
 
-			return ret;
-		}
+		pol = vma->vm_ops->get_policy(vma, vma->vm_start);
+		if (pol && (pol->flags & MPOL_F_MOF))
+			ret = true;
+		mpol_cond_put(pol);
 
-		pol = vma->vm_policy;
+		return ret;
 	}
 
+	pol = vma->vm_policy;
 	if (!pol)
-		pol = get_task_policy(task);
+		pol = get_task_policy(current);
 
 	return pol->flags & MPOL_F_MOF;
 }
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 0/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race
  2014-08-20 19:22 ` [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations Oleg Nesterov
                     ` (3 preceding siblings ...)
  2014-08-20 19:23   ` [PATCH 4/4] mempolicy: remove the "task" arg of vma_policy_mof() and simplify it Oleg Nesterov
@ 2014-08-21 19:00   ` Oleg Nesterov
  2014-08-21 19:00     ` [PATCH 1/4] mempolicy: introduce __get_vma_policy(), export get_task_policy() Oleg Nesterov
                       ` (3 more replies)
  4 siblings, 4 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-21 19:00 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

On 08/20, Oleg Nesterov wrote:
>
> I belive that 9e7814404b77c3e8 "hold task->mempolicy while numa_maps scans"
> was wrong and it doesn't fix the problem.

See 2/4.

> I'll try to send the fix tomorrow, could you review the initial preparations?
> To me, these changes make sense as cleanups in any case, but I don't really
> understand this code and this was only compile tested.

Untested again because I do not know how, please review ;) On top of
"mempolicy: get_task_policy() cleanups/preparations" series.

I'll send the final cleanups later.

Oleg.

 fs/proc/task_mmu.c        |   33 ++++++-----------------
 include/linux/mempolicy.h |    5 ++-
 mm/mempolicy.c            |   64 +++++++++++++++++++-------------------------
 3 files changed, 40 insertions(+), 62 deletions(-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/4] mempolicy: introduce __get_vma_policy(), export get_task_policy()
  2014-08-21 19:00   ` [PATCH 0/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
@ 2014-08-21 19:00     ` Oleg Nesterov
  2014-08-21 19:01     ` [PATCH 2/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-21 19:00 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

Extract the code which looks for vma's policy from get_vma_policy()
into the new helper, __get_vma_policy(). Export get_task_policy().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/mempolicy.h |    3 +++
 mm/mempolicy.c            |   44 ++++++++++++++++++++++++++------------------
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5e4bfce..e1abe24 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -134,6 +134,9 @@ void mpol_free_shared_policy(struct shared_policy *p);
 struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 					    unsigned long idx);
 
+struct mempolicy *get_task_policy(struct task_struct *p);
+struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
+		unsigned long addr);
 struct mempolicy *get_vma_policy(struct task_struct *tsk,
 		struct vm_area_struct *vma, unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ad27bbc..4378c33 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -123,7 +123,7 @@ static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
-static struct mempolicy *get_task_policy(struct task_struct *p)
+struct mempolicy *get_task_policy(struct task_struct *p)
 {
 	struct mempolicy *pol = p->mempolicy;
 	int node;
@@ -1603,23 +1603,8 @@ COMPAT_SYSCALL_DEFINE6(mbind, compat_ulong_t, start, compat_ulong_t, len,
 
 #endif
 
-/*
- * get_vma_policy(@task, @vma, @addr)
- * @task: task for fallback if vma policy == default
- * @vma: virtual memory area whose policy is sought
- * @addr: address in @vma for shared policy lookup
- *
- * Returns effective policy for a VMA at specified address.
- * Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task.  It is the caller's responsibility to free the
- * extra reference for shared policies.
- */
-struct mempolicy *get_vma_policy(struct task_struct *task,
-		struct vm_area_struct *vma, unsigned long addr)
+struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr)
 {
 	struct mempolicy *pol = NULL;
 
@@ -1640,6 +1625,29 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 		}
 	}
 
+	return pol;
+}
+
+/*
+ * get_vma_policy(@task, @vma, @addr)
+ * @task: task for fallback if vma policy == default
+ * @vma: virtual memory area whose policy is sought
+ * @addr: address in @vma for shared policy lookup
+ *
+ * Returns effective policy for a VMA at specified address.
+ * Falls back to @task or system default policy, as necessary.
+ * Current or other task's task mempolicy and non-shared vma policies must be
+ * protected by task_lock(task) by the caller.
+ * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
+ * count--added by the get_policy() vm_op, as appropriate--to protect against
+ * freeing by another task.  It is the caller's responsibility to free the
+ * extra reference for shared policies.
+ */
+struct mempolicy *get_vma_policy(struct task_struct *task,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol = __get_vma_policy(vma, addr);
+
 	if (!pol)
 		pol = get_task_policy(task);
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race
  2014-08-21 19:00   ` [PATCH 0/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
  2014-08-21 19:00     ` [PATCH 1/4] mempolicy: introduce __get_vma_policy(), export get_task_policy() Oleg Nesterov
@ 2014-08-21 19:01     ` Oleg Nesterov
  2014-08-21 19:01     ` [PATCH 3/4] mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem) Oleg Nesterov
  2014-08-21 19:01     ` [PATCH 4/4] mempolicy: unexport get_vma_policy() and remove its "task" arg Oleg Nesterov
  3 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-21 19:01 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

9e7814404b77 "hold task->mempolicy while numa_maps scans." fixed the
race with the exiting task but this is not enough.

The current code assumes that get_vma_policy(task) should either see
task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
by hold_task_mempolicy(), so we can never race with __mpol_put(). But
this can only work if we can't race with do_set_mempolicy(), and thus
we can't race with another do_set_mempolicy() or do_exit() after that.

However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
race. This task can exec, change it's ->mm, and call do_set_mempolicy()
after that; in this case they take 2 different locks.

Change hold_task_mempolicy() to use get_task_policy(), it never returns
NULL, and change show_numa_map() to use __get_vma_policy() or fall back
to proc_priv->task_mempolicy.

Note: this is the minimal fix, we will cleanup this code later. I think
hold_task_mempolicy() and release_task_mempolicy() should die, we can
move this logic into show_numa_map(). Or we can move get_task_policy()
outside of ->mmap_sem and !CONFIG_NUMA code at least.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/task_mmu.c |   33 +++++++++------------------------
 1 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9d7a9a6..7810aba 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -87,32 +87,14 @@ unsigned long task_statm(struct mm_struct *mm,
 
 #ifdef CONFIG_NUMA
 /*
- * These functions are for numa_maps but called in generic **maps seq_file
- * ->start(), ->stop() ops.
- *
- * numa_maps scans all vmas under mmap_sem and checks their mempolicy.
- * Each mempolicy object is controlled by reference counting. The problem here
- * is how to avoid accessing dead mempolicy object.
- *
- * Because we're holding mmap_sem while reading seq_file, it's safe to access
- * each vma's mempolicy, no vma objects will never drop refs to mempolicy.
- *
- * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy
- * is set and replaced under mmap_sem but unrefed and cleared under task_lock().
- * So, without task_lock(), we cannot trust get_vma_policy() because we cannot
- * gurantee the task never exits under us. But taking task_lock() around
- * get_vma_plicy() causes lock order problem.
- *
- * To access task->mempolicy without lock, we hold a reference count of an
- * object pointed by task->mempolicy and remember it. This will guarantee
- * that task->mempolicy points to an alive object or NULL in numa_maps accesses.
+ * Save get_task_policy() for show_numa_map().
  */
 static void hold_task_mempolicy(struct proc_maps_private *priv)
 {
 	struct task_struct *task = priv->task;
 
 	task_lock(task);
-	priv->task_mempolicy = task->mempolicy;
+	priv->task_mempolicy = get_task_policy(task);
 	mpol_get(priv->task_mempolicy);
 	task_unlock(task);
 }
@@ -1423,7 +1405,6 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	struct vm_area_struct *vma = v;
 	struct numa_maps *md = &numa_priv->md;
 	struct file *file = vma->vm_file;
-	struct task_struct *task = proc_priv->task;
 	struct mm_struct *mm = vma->vm_mm;
 	struct mm_walk walk = {};
 	struct mempolicy *pol;
@@ -1443,9 +1424,13 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	walk.private = md;
 	walk.mm = mm;
 
-	pol = get_vma_policy(task, vma, vma->vm_start);
-	mpol_to_str(buffer, sizeof(buffer), pol);
-	mpol_cond_put(pol);
+	pol = __get_vma_policy(vma, vma->vm_start);
+	if (pol) {
+		mpol_to_str(buffer, sizeof(buffer), pol);
+		mpol_cond_put(pol);
+	} else {
+		mpol_to_str(buffer, sizeof(buffer), proc_priv->task_mempolicy);
+	}
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem)
  2014-08-21 19:00   ` [PATCH 0/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
  2014-08-21 19:00     ` [PATCH 1/4] mempolicy: introduce __get_vma_policy(), export get_task_policy() Oleg Nesterov
  2014-08-21 19:01     ` [PATCH 2/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
@ 2014-08-21 19:01     ` Oleg Nesterov
  2014-08-21 19:01     ` [PATCH 4/4] mempolicy: unexport get_vma_policy() and remove its "task" arg Oleg Nesterov
  3 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-21 19:01 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

Remove down_write(&mm->mmap_sem) in do_set_mempolicy(). This logic
was never correct and it is no longer needed, see the previous patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/mempolicy.c |   15 +--------------
 1 files changed, 1 insertions(+), 14 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4378c33..9695a9a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -802,7 +802,6 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 			     nodemask_t *nodes)
 {
 	struct mempolicy *new, *old;
-	struct mm_struct *mm = current->mm;
 	NODEMASK_SCRATCH(scratch);
 	int ret;
 
@@ -814,20 +813,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 		ret = PTR_ERR(new);
 		goto out;
 	}
-	/*
-	 * prevent changing our mempolicy while show_numa_maps()
-	 * is using it.
-	 * Note:  do_set_mempolicy() can be called at init time
-	 * with no 'mm'.
-	 */
-	if (mm)
-		down_write(&mm->mmap_sem);
+
 	task_lock(current);
 	ret = mpol_set_nodemask(new, nodes, scratch);
 	if (ret) {
 		task_unlock(current);
-		if (mm)
-			up_write(&mm->mmap_sem);
 		mpol_put(new);
 		goto out;
 	}
@@ -837,9 +827,6 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 	    nodes_weight(new->v.nodes))
 		current->il_next = first_node(new->v.nodes);
 	task_unlock(current);
-	if (mm)
-		up_write(&mm->mmap_sem);
-
 	mpol_put(old);
 	ret = 0;
 out:
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/4] mempolicy: unexport get_vma_policy() and remove its "task" arg
  2014-08-21 19:00   ` [PATCH 0/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
                       ` (2 preceding siblings ...)
  2014-08-21 19:01     ` [PATCH 3/4] mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem) Oleg Nesterov
@ 2014-08-21 19:01     ` Oleg Nesterov
  3 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-08-21 19:01 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, David Rientjes, KOSAKI Motohiro
  Cc: Alexander Viro, Cyrill Gorcunov, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Hugh Dickins, linux-kernel

- get_vma_policy(task) is not safe if task != current, remove this
  argument.

- get_vma_policy() no longer has callers outside of mempolicy.c,
  make it static.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/mempolicy.h |    2 --
 mm/mempolicy.c            |   19 ++++++++-----------
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e1abe24..3d385c8 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -137,8 +137,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
-struct mempolicy *get_vma_policy(struct task_struct *tsk,
-		struct vm_area_struct *vma, unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9695a9a..008fb32 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1616,27 +1616,24 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 }
 
 /*
- * get_vma_policy(@task, @vma, @addr)
- * @task: task for fallback if vma policy == default
+ * get_vma_policy(@vma, @addr)
  * @vma: virtual memory area whose policy is sought
  * @addr: address in @vma for shared policy lookup
  *
  * Returns effective policy for a VMA at specified address.
- * Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
+ * Falls back to current->mempolicy or system default policy, as necessary.
  * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
  * count--added by the get_policy() vm_op, as appropriate--to protect against
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-struct mempolicy *get_vma_policy(struct task_struct *task,
-		struct vm_area_struct *vma, unsigned long addr)
+static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
 
 	if (!pol)
-		pol = get_task_policy(task);
+		pol = get_task_policy(current);
 
 	return pol;
 }
@@ -1864,7 +1861,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 {
 	struct zonelist *zl;
 
-	*mpol = get_vma_policy(current, vma, addr);
+	*mpol = get_vma_policy(vma, addr);
 	*nodemask = NULL;	/* assume !MPOL_BIND */
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
@@ -2019,7 +2016,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
-	pol = get_vma_policy(current, vma, addr);
+	pol = get_vma_policy(vma, addr);
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
@@ -2285,7 +2282,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	BUG_ON(!vma);
 
-	pol = get_vma_policy(current, vma, addr);
+	pol = get_vma_policy(vma, addr);
 	if (!(pol->flags & MPOL_F_MOF))
 		goto out;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
  2014-08-05 19:46 ` [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open() Oleg Nesterov
  2014-08-06 10:05   ` Kirill A. Shutemov
@ 2014-12-03 14:14   ` Kirill A. Shutemov
  2014-12-03 16:59     ` Eric W. Biederman
  2014-12-03 17:34     ` Oleg Nesterov
  1 sibling, 2 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-12-03 14:14 UTC (permalink / raw)
  To: Oleg Nesterov, David S. Miller, Linus Torvalds
  Cc: Andrew Morton, Alexander Viro, Cyrill Gorcunov, David Howells,
	Eric W. Biederman, Kirill A. Shutemov, Peter Zijlstra,
	Sasha Levin, linux-fsdevel, linux-kernel, Alexey Dobriyan,
	netdev

On Tue, Aug 05, 2014 at 09:46:55PM +0200, Oleg Nesterov wrote:
> A simple test-case from Kirill Shutemov
> 
> 	cat /proc/self/maps >/dev/null
> 	chmod +x /proc/self/net/packet
> 	exec /proc/self/net/packet
> 
> makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
> the opposite order.

Oleg, I see it again with almost the same test-case:

	cat /proc/self/stack >/dev/null
	chmod +x /proc/self/net/packet
	exec /proc/self/net/packet

Looks like bunch of proc files were converted to use seq_file by Alexey
Dobriyan around the same time you've fixed the issue for /proc/pid/maps.

More generic test-case:

	find /proc/self/ -type f -exec dd if='{}' of=/dev/null bs=1 count=1 ';' 2>/dev/null
	chmod +x /proc/self/net/packet
	exec /proc/self/net/packet

David, any justification for allowing chmod +x for files under
/proc/pid/net?

[    2.042212] ======================================================
[    2.042930] [ INFO: possible circular locking dependency detected ]
[    2.043648] 3.18.0-rc7-00003-g3a18ca061311-dirty #237 Not tainted
[    2.044350] -------------------------------------------------------
[    2.045054] sh/94 is trying to acquire lock:
[    2.045546]  (&p->lock){+.+.+.}, at: [<ffffffff811e12fd>] seq_read+0x3d/0x3e0
[    2.045781] 
[    2.045781] but task is already holding lock:
[    2.045781]  (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff811c0e3d>] prepare_bprm_creds+0x2d/0x90
[    2.045781] 
[    2.045781] which lock already depends on the new lock.
[    2.045781] 
[    2.045781] 
[    2.045781] the existing dependency chain (in reverse order) is:
[    2.045781] 
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
[    2.045781]        [<ffffffff810a6e99>] __lock_acquire+0x4d9/0xd40
[    2.045781]        [<ffffffff810a7ff2>] lock_acquire+0xd2/0x2a0
[    2.045781]        [<ffffffff81849da6>] mutex_lock_killable_nested+0x66/0x460
[    2.045781]        [<ffffffff81229de4>] lock_trace+0x24/0x70
[    2.045781]        [<ffffffff81229e8f>] proc_pid_stack+0x5f/0xe0
[    2.045781]        [<ffffffff81227244>] proc_single_show+0x54/0xa0
[    2.045781]        [<ffffffff811e13a0>] seq_read+0xe0/0x3e0
[    2.045781]        [<ffffffff811b9377>] vfs_read+0x97/0x180
[    2.045781]        [<ffffffff811b9f5d>] SyS_read+0x4d/0xc0
[    2.045781]        [<ffffffff8184e492>] system_call_fastpath+0x12/0x17
[    2.045781] 
-> #0 (&p->lock){+.+.+.}:
[    2.045781]        [<ffffffff810a389f>] validate_chain.isra.36+0xfff/0x1400
[    2.045781]        [<ffffffff810a6e99>] __lock_acquire+0x4d9/0xd40
[    2.045781]        [<ffffffff810a7ff2>] lock_acquire+0xd2/0x2a0
[    2.045781]        [<ffffffff81849629>] mutex_lock_nested+0x69/0x3c0
[    2.045781]        [<ffffffff811e12fd>] seq_read+0x3d/0x3e0
[    2.045781]        [<ffffffff81226428>] proc_reg_read+0x48/0x70
[    2.045781]        [<ffffffff811b9377>] vfs_read+0x97/0x180
[    2.045781]        [<ffffffff811bf1a8>] kernel_read+0x48/0x60
[    2.045781]        [<ffffffff811bfb2c>] prepare_binprm+0xdc/0x180
[    2.045781]        [<ffffffff811c139a>] do_execve_common.isra.29+0x4fa/0x960
[    2.045781]        [<ffffffff811c1818>] do_execve+0x18/0x20
[    2.045781]        [<ffffffff811c1b05>] SyS_execve+0x25/0x30
[    2.045781]        [<ffffffff8184ea49>] stub_execve+0x69/0xa0
[    2.045781] 
[    2.045781] other info that might help us debug this:
[    2.045781] 
[    2.045781]  Possible unsafe locking scenario:
[    2.045781] 
[    2.045781]        CPU0                    CPU1
[    2.045781]        ----                    ----
[    2.045781]   lock(&sig->cred_guard_mutex);
[    2.045781]                                lock(&p->lock);
[    2.045781]                                lock(&sig->cred_guard_mutex);
[    2.045781]   lock(&p->lock);
[    2.045781] 
[    2.045781]  *** DEADLOCK ***
[    2.045781] 
[    2.045781] 1 lock held by sh/94:
[    2.045781]  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff811c0e3d>] prepare_bprm_creds+0x2d/0x90
[    2.045781] 
[    2.045781] stack backtrace:
[    2.045781] CPU: 0 PID: 94 Comm: sh Not tainted 3.18.0-rc7-00003-g3a18ca061311-dirty #237
[    2.045781] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[    2.045781]  ffffffff82a48d50 ffff88085427bad8 ffffffff81844a85 0000000000000cac
[    2.045781]  ffffffff82a654a0 ffff88085427bb28 ffffffff810a1b03 0000000000000000
[    2.045781]  ffff88085427bb68 ffff88085427bb28 ffff8808547f1500 ffff8808547f1c40
[    2.045781] Call Trace:
[    2.045781]  [<ffffffff81844a85>] dump_stack+0x4e/0x68
[    2.045781]  [<ffffffff810a1b03>] print_circular_bug+0x203/0x310
[    2.045781]  [<ffffffff810a389f>] validate_chain.isra.36+0xfff/0x1400
[    2.045781]  [<ffffffff8108fa76>] ? local_clock+0x16/0x30
[    2.045781]  [<ffffffff810a6e99>] __lock_acquire+0x4d9/0xd40
[    2.045781]  [<ffffffff810a7ff2>] lock_acquire+0xd2/0x2a0
[    2.045781]  [<ffffffff811e12fd>] ? seq_read+0x3d/0x3e0
[    2.045781]  [<ffffffff81849629>] mutex_lock_nested+0x69/0x3c0
[    2.045781]  [<ffffffff811e12fd>] ? seq_read+0x3d/0x3e0
[    2.045781]  [<ffffffff8108f9f8>] ? sched_clock_cpu+0x98/0xc0
[    2.045781]  [<ffffffff811e12fd>] ? seq_read+0x3d/0x3e0
[    2.045781]  [<ffffffff814050b9>] ? lockref_put_or_lock+0x29/0x40
[    2.045781]  [<ffffffff811e12fd>] seq_read+0x3d/0x3e0
[    2.045781]  [<ffffffff814050b9>] ? lockref_put_or_lock+0x29/0x40
[    2.045781]  [<ffffffff81226428>] proc_reg_read+0x48/0x70
[    2.045781]  [<ffffffff811b9377>] vfs_read+0x97/0x180
[    2.045781]  [<ffffffff811bf1a8>] kernel_read+0x48/0x60
[    2.045781]  [<ffffffff811bfb2c>] prepare_binprm+0xdc/0x180
[    2.045781]  [<ffffffff811c139a>] do_execve_common.isra.29+0x4fa/0x960
[    2.092142] tsc: Refined TSC clocksource calibration: 2693.484 MHz
[    2.045781]  [<ffffffff811c0fd3>] ? do_execve_common.isra.29+0x133/0x960
[    2.045781]  [<ffffffff8184f04d>] ? retint_swapgs+0xe/0x13
[    2.045781]  [<ffffffff811c1818>] do_execve+0x18/0x20
[    2.045781]  [<ffffffff811c1b05>] SyS_execve+0x25/0x30
[    2.045781]  [<ffffffff8184ea49>] stub_execve+0x69/0xa0
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
  2014-12-03 14:14   ` Kirill A. Shutemov
@ 2014-12-03 16:59     ` Eric W. Biederman
  2014-12-04 16:17       ` Kirill A. Shutemov
  2014-12-03 17:34     ` Oleg Nesterov
  1 sibling, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2014-12-03 16:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Oleg Nesterov, David S. Miller, Linus Torvalds, Andrew Morton,
	Alexander Viro, Cyrill Gorcunov, David Howells,
	Kirill A. Shutemov, Peter Zijlstra, Sasha Levin, linux-fsdevel,
	linux-kernel, Alexey Dobriyan, netdev

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Aug 05, 2014 at 09:46:55PM +0200, Oleg Nesterov wrote:
>> A simple test-case from Kirill Shutemov
>> 
>> 	cat /proc/self/maps >/dev/null
>> 	chmod +x /proc/self/net/packet
>> 	exec /proc/self/net/packet
>> 
>> makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
>> the opposite order.
>
> Oleg, I see it again with almost the same test-case:
>
> 	cat /proc/self/stack >/dev/null
> 	chmod +x /proc/self/net/packet
> 	exec /proc/self/net/packet
>
> Looks like bunch of proc files were converted to use seq_file by Alexey
> Dobriyan around the same time you've fixed the issue for /proc/pid/maps.
>
> More generic test-case:
>
> 	find /proc/self/ -type f -exec dd if='{}' of=/dev/null bs=1 count=1 ';' 2>/dev/null
> 	chmod +x /proc/self/net/packet
> 	exec /proc/self/net/packet
>
> David, any justification for allowing chmod +x for files under
> /proc/pid/net?

I don't think there are any good reasons for allowing chmod +x for the
proc generic files.   Certainly executing any of them is nonsense.

I do recall some weird conner cases existing.  I think they resulted
in a need to preserve chmod if not chmod +x.  This is just me saying
tread carefully before you change anything.

It really should be safe to tweak proc_notify_change to not allow
messing with the executable bits of proc files.

> [    2.042212] ======================================================
> [    2.042930] [ INFO: possible circular locking dependency detected ]
> [    2.043648] 3.18.0-rc7-00003-g3a18ca061311-dirty #237 Not tainted
> [    2.044350] -------------------------------------------------------
> [    2.045054] sh/94 is trying to acquire lock:
> [    2.045546]  (&p->lock){+.+.+.}, at: [<ffffffff811e12fd>] seq_read+0x3d/0x3e0
> [    2.045781] 
> [    2.045781] but task is already holding lock:
> [    2.045781]  (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff811c0e3d>] prepare_bprm_creds+0x2d/0x90
> [    2.045781] 
> [    2.045781] which lock already depends on the new lock.
> [    2.045781] 
> [    2.045781] 
> [    2.045781] the existing dependency chain (in reverse order) is:
> [    2.045781] 
> -> #1 (&sig->cred_guard_mutex){+.+.+.}:
> [    2.045781]        [<ffffffff810a6e99>] __lock_acquire+0x4d9/0xd40
> [    2.045781]        [<ffffffff810a7ff2>] lock_acquire+0xd2/0x2a0
> [    2.045781]        [<ffffffff81849da6>] mutex_lock_killable_nested+0x66/0x460
> [    2.045781]        [<ffffffff81229de4>] lock_trace+0x24/0x70
> [    2.045781]        [<ffffffff81229e8f>] proc_pid_stack+0x5f/0xe0
> [    2.045781]        [<ffffffff81227244>] proc_single_show+0x54/0xa0
> [    2.045781]        [<ffffffff811e13a0>] seq_read+0xe0/0x3e0
> [    2.045781]        [<ffffffff811b9377>] vfs_read+0x97/0x180
> [    2.045781]        [<ffffffff811b9f5d>] SyS_read+0x4d/0xc0
> [    2.045781]        [<ffffffff8184e492>] system_call_fastpath+0x12/0x17
> [    2.045781] 
> -> #0 (&p->lock){+.+.+.}:
> [    2.045781]        [<ffffffff810a389f>] validate_chain.isra.36+0xfff/0x1400
> [    2.045781]        [<ffffffff810a6e99>] __lock_acquire+0x4d9/0xd40
> [    2.045781]        [<ffffffff810a7ff2>] lock_acquire+0xd2/0x2a0
> [    2.045781]        [<ffffffff81849629>] mutex_lock_nested+0x69/0x3c0
> [    2.045781]        [<ffffffff811e12fd>] seq_read+0x3d/0x3e0
> [    2.045781]        [<ffffffff81226428>] proc_reg_read+0x48/0x70
> [    2.045781]        [<ffffffff811b9377>] vfs_read+0x97/0x180
> [    2.045781]        [<ffffffff811bf1a8>] kernel_read+0x48/0x60
> [    2.045781]        [<ffffffff811bfb2c>] prepare_binprm+0xdc/0x180
> [    2.045781]        [<ffffffff811c139a>] do_execve_common.isra.29+0x4fa/0x960
> [    2.045781]        [<ffffffff811c1818>] do_execve+0x18/0x20
> [    2.045781]        [<ffffffff811c1b05>] SyS_execve+0x25/0x30
> [    2.045781]        [<ffffffff8184ea49>] stub_execve+0x69/0xa0
> [    2.045781] 
> [    2.045781] other info that might help us debug this:
> [    2.045781] 
> [    2.045781]  Possible unsafe locking scenario:
> [    2.045781] 
> [    2.045781]        CPU0                    CPU1
> [    2.045781]        ----                    ----
> [    2.045781]   lock(&sig->cred_guard_mutex);
> [    2.045781]                                lock(&p->lock);
> [    2.045781]                                lock(&sig->cred_guard_mutex);
> [    2.045781]   lock(&p->lock);
> [    2.045781] 
> [    2.045781]  *** DEADLOCK ***
> [    2.045781] 
> [    2.045781] 1 lock held by sh/94:
> [    2.045781]  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff811c0e3d>] prepare_bprm_creds+0x2d/0x90
> [    2.045781] 
> [    2.045781] stack backtrace:
> [    2.045781] CPU: 0 PID: 94 Comm: sh Not tainted 3.18.0-rc7-00003-g3a18ca061311-dirty #237
> [    2.045781] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [    2.045781]  ffffffff82a48d50 ffff88085427bad8 ffffffff81844a85 0000000000000cac
> [    2.045781]  ffffffff82a654a0 ffff88085427bb28 ffffffff810a1b03 0000000000000000
> [    2.045781]  ffff88085427bb68 ffff88085427bb28 ffff8808547f1500 ffff8808547f1c40
> [    2.045781] Call Trace:
> [    2.045781]  [<ffffffff81844a85>] dump_stack+0x4e/0x68
> [    2.045781]  [<ffffffff810a1b03>] print_circular_bug+0x203/0x310
> [    2.045781]  [<ffffffff810a389f>] validate_chain.isra.36+0xfff/0x1400
> [    2.045781]  [<ffffffff8108fa76>] ? local_clock+0x16/0x30
> [    2.045781]  [<ffffffff810a6e99>] __lock_acquire+0x4d9/0xd40
> [    2.045781]  [<ffffffff810a7ff2>] lock_acquire+0xd2/0x2a0
> [    2.045781]  [<ffffffff811e12fd>] ? seq_read+0x3d/0x3e0
> [    2.045781]  [<ffffffff81849629>] mutex_lock_nested+0x69/0x3c0
> [    2.045781]  [<ffffffff811e12fd>] ? seq_read+0x3d/0x3e0
> [    2.045781]  [<ffffffff8108f9f8>] ? sched_clock_cpu+0x98/0xc0
> [    2.045781]  [<ffffffff811e12fd>] ? seq_read+0x3d/0x3e0
> [    2.045781]  [<ffffffff814050b9>] ? lockref_put_or_lock+0x29/0x40
> [    2.045781]  [<ffffffff811e12fd>] seq_read+0x3d/0x3e0
> [    2.045781]  [<ffffffff814050b9>] ? lockref_put_or_lock+0x29/0x40
> [    2.045781]  [<ffffffff81226428>] proc_reg_read+0x48/0x70
> [    2.045781]  [<ffffffff811b9377>] vfs_read+0x97/0x180
> [    2.045781]  [<ffffffff811bf1a8>] kernel_read+0x48/0x60
> [    2.045781]  [<ffffffff811bfb2c>] prepare_binprm+0xdc/0x180
> [    2.045781]  [<ffffffff811c139a>] do_execve_common.isra.29+0x4fa/0x960
> [    2.092142] tsc: Refined TSC clocksource calibration: 2693.484 MHz
> [    2.045781]  [<ffffffff811c0fd3>] ? do_execve_common.isra.29+0x133/0x960
> [    2.045781]  [<ffffffff8184f04d>] ? retint_swapgs+0xe/0x13
> [    2.045781]  [<ffffffff811c1818>] do_execve+0x18/0x20
> [    2.045781]  [<ffffffff811c1b05>] SyS_execve+0x25/0x30
> [    2.045781]  [<ffffffff8184ea49>] stub_execve+0x69/0xa0

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
  2014-12-03 14:14   ` Kirill A. Shutemov
  2014-12-03 16:59     ` Eric W. Biederman
@ 2014-12-03 17:34     ` Oleg Nesterov
  1 sibling, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2014-12-03 17:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: David S. Miller, Linus Torvalds, Andrew Morton, Alexander Viro,
	Cyrill Gorcunov, David Howells, Eric W. Biederman,
	Kirill A. Shutemov, Peter Zijlstra, Sasha Levin, linux-fsdevel,
	linux-kernel, Alexey Dobriyan, netdev

On 12/03, Kirill A. Shutemov wrote:
>
> On Tue, Aug 05, 2014 at 09:46:55PM +0200, Oleg Nesterov wrote:
> > A simple test-case from Kirill Shutemov
> >
> > 	cat /proc/self/maps >/dev/null
> > 	chmod +x /proc/self/net/packet
> > 	exec /proc/self/net/packet
> >
> > makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
> > the opposite order.
>
> Oleg, I see it again with almost the same test-case:
>
> 	cat /proc/self/stack >/dev/null
> 	chmod +x /proc/self/net/packet
> 	exec /proc/self/net/packet

Yes, there are more lock_trace/mm_access (ab)users. Fortunately, they
are much simpler than proc/pid/maps (which also asked for other cleanups
and fixes).

I'll try to take a look, thanks for reminding.

And I agree with Eric, chmod+x probably makes no sense. Still I think
this code deserves some cleanups regardless. To the point I think that
lock_trace() should probably die.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
  2014-12-03 16:59     ` Eric W. Biederman
@ 2014-12-04 16:17       ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-12-04 16:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, David S. Miller, Linus Torvalds, Andrew Morton,
	Alexander Viro, Cyrill Gorcunov, David Howells,
	Kirill A. Shutemov, Peter Zijlstra, Sasha Levin, linux-fsdevel,
	linux-kernel, Alexey Dobriyan, netdev

On Wed, Dec 03, 2014 at 10:59:57AM -0600, Eric W. Biederman wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > On Tue, Aug 05, 2014 at 09:46:55PM +0200, Oleg Nesterov wrote:
> >> A simple test-case from Kirill Shutemov
> >> 
> >> 	cat /proc/self/maps >/dev/null
> >> 	chmod +x /proc/self/net/packet
> >> 	exec /proc/self/net/packet
> >> 
> >> makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
> >> the opposite order.
> >
> > Oleg, I see it again with almost the same test-case:
> >
> > 	cat /proc/self/stack >/dev/null
> > 	chmod +x /proc/self/net/packet
> > 	exec /proc/self/net/packet
> >
> > Looks like bunch of proc files were converted to use seq_file by Alexey
> > Dobriyan around the same time you've fixed the issue for /proc/pid/maps.
> >
> > More generic test-case:
> >
> > 	find /proc/self/ -type f -exec dd if='{}' of=/dev/null bs=1 count=1 ';' 2>/dev/null
> > 	chmod +x /proc/self/net/packet
> > 	exec /proc/self/net/packet
> >
> > David, any justification for allowing chmod +x for files under
> > /proc/pid/net?
> 
> I don't think there are any good reasons for allowing chmod +x for the
> proc generic files.   Certainly executing any of them is nonsense.
> 
> I do recall some weird conner cases existing.  I think they resulted
> in a need to preserve chmod if not chmod +x.  This is just me saying
> tread carefully before you change anything.
> 
> It really should be safe to tweak proc_notify_change to not allow
> messing with the executable bits of proc files.

BTW, we have MS_NOSUID and MS_NOEXEC set in ->s_flags for procfs since
2006 -- see 92d032855e64.

But there's no code which would translate them into vfsmount->mnt_flags |=
MNT_NOSUID/MNT_NOEXEC and we bypast nosuid/noexec checks on exec path.

Hm?..

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2014-12-04 16:21 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-05 19:46 [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Oleg Nesterov
2014-08-05 19:46 ` [PATCH v2 1/7] fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map() Oleg Nesterov
2014-08-06  9:52   ` Kirill A. Shutemov
2014-08-06 14:10     ` Oleg Nesterov
2014-08-06 19:41       ` Oleg Nesterov
2014-08-05 19:46 ` [PATCH v2 2/7] fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open() Oleg Nesterov
2014-08-06  9:55   ` Kirill A. Shutemov
2014-08-05 19:46 ` [PATCH v2 3/7] proc: introduce proc_mem_open() Oleg Nesterov
2014-08-06  9:57   ` Kirill A. Shutemov
2014-08-05 19:46 ` [PATCH v2 4/7] fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open() Oleg Nesterov
2014-08-06 10:05   ` Kirill A. Shutemov
2014-12-03 14:14   ` Kirill A. Shutemov
2014-12-03 16:59     ` Eric W. Biederman
2014-12-04 16:17       ` Kirill A. Shutemov
2014-12-03 17:34     ` Oleg Nesterov
2014-08-05 19:46 ` [PATCH v2 5/7] fs/proc/task_mmu.c: simplify the vma_stop() logic Oleg Nesterov
2014-08-06 10:12   ` Kirill A. Shutemov
2014-08-05 19:47 ` [PATCH v2 6/7] fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next() Oleg Nesterov
2014-08-06 10:17   ` Kirill A. Shutemov
2014-08-05 19:47 ` [PATCH v2 7/7] fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop() Oleg Nesterov
2014-08-06 10:18   ` Kirill A. Shutemov
2014-08-06 10:19 ` [PATCH v2 0/7] /proc/PID/*maps* fixes/cleanups Kirill A. Shutemov
2014-08-06 18:48 ` Cyrill Gorcunov
2014-08-07 19:17 ` [PATCH 0/5] fs/proc/task_mmu.c: cleanup tail_vma/last_addr mess Oleg Nesterov
2014-08-07 19:17   ` [PATCH 1/5] fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic Oleg Nesterov
2014-08-07 19:17   ` [PATCH 2/5] fs/proc/task_mmu.c: simplify m_start() to make it readable Oleg Nesterov
2014-08-07 19:17   ` [PATCH 3/5] fs/proc/task_mmu.c: introduce m_next_vma() helper Oleg Nesterov
2014-08-07 19:17   ` [PATCH 4/5] fs/proc/task_mmu.c: reintroduce m->version logic Oleg Nesterov
2014-08-07 19:18   ` [PATCH 5/5] fs/proc/task_mmu.c: update m->version in the main loop in m_start() Oleg Nesterov
2014-08-11 17:00 ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Oleg Nesterov
2014-08-11 17:00   ` [PATCH 1/3] fs/proc/task_nommu.c: change maps_open() to use __seq_open_private() Oleg Nesterov
2014-08-11 17:00   ` [PATCH 2/3] fs/proc/task_nommu.c: shift mm_access() from m_start() to proc_maps_open() Oleg Nesterov
2014-08-11 17:00   ` [PATCH 3/3] fs/proc/task_nommu.c: don't use priv->task->mm Oleg Nesterov
2014-08-12 13:57   ` [PATCH 0/3] fs/proc/task_nommu.c: copy-and-paste the changes from task_mmu.c Greg Ungerer
2014-08-12 17:35     ` Oleg Nesterov
2014-08-20 16:49 ` [PATCH -mm 0/2] make /proc/PID/*maps* namespace friendly Oleg Nesterov
2014-08-20 16:49   ` [PATCH -mm 1/2] proc/maps: replace proc_maps_private->pid with "struct inode *inode" Oleg Nesterov
2014-08-20 16:50   ` [PATCH -mm 2/2] proc/maps: make vm_is_stack() logic namespace-friendly Oleg Nesterov
2014-08-20 19:22 ` [PATCH 0/4] mempolicy: get_task_policy() cleanups/preparations Oleg Nesterov
2014-08-20 19:22   ` [PATCH 1/4] mempolicy: change alloc_pages_vma() to use mpol_cond_put() Oleg Nesterov
2014-08-20 19:23   ` [PATCH 2/4] mempolicy: change get_task_policy() to return default_policy rather than NULL Oleg Nesterov
2014-08-20 19:23   ` [PATCH 3/4] mempolicy: sanitize the usage of get_task_policy() Oleg Nesterov
2014-08-20 19:23   ` [PATCH 4/4] mempolicy: remove the "task" arg of vma_policy_mof() and simplify it Oleg Nesterov
2014-08-21 19:00   ` [PATCH 0/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
2014-08-21 19:00     ` [PATCH 1/4] mempolicy: introduce __get_vma_policy(), export get_task_policy() Oleg Nesterov
2014-08-21 19:01     ` [PATCH 2/4] mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race Oleg Nesterov
2014-08-21 19:01     ` [PATCH 3/4] mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem) Oleg Nesterov
2014-08-21 19:01     ` [PATCH 4/4] mempolicy: unexport get_vma_policy() and remove its "task" arg Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.