All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Generic support for revoking mappings
@ 2010-09-25 23:33 ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-25 23:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel


During hotplug or rmmod if a device or file is mmaped, it's mapping
needs to be removed and future access need to return SIGBUS almost like
truncate.  This case is rare enough that it barely gets tested, and
making a generic implementation warranted.  I have tried with sysfs but
a complete and generic implementation does not seem possible without mm
support.

It looks like a fully generic implementation with mm knowledge
is shorter and easier to get right than what I have in sysfs today.

So here is that fully generic implementation.

Eric W. Biederman (3):
      mm: Introduce revoke_mappings.
      mm: Consolidate vma destruction into remove_vma.
      mm: Cause revoke_mappings to wait until all close methods have completed.
---
 include/linux/fs.h |    2 +
 include/linux/mm.h |    2 +
 mm/Makefile        |    2 +-
 mm/mmap.c          |   34 +++++-----
 mm/nommu.c         |    5 ++
 mm/revoke.c        |  192 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 219 insertions(+), 18 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 0/3] Generic support for revoking mappings
@ 2010-09-25 23:33 ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-25 23:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel


During hotplug or rmmod if a device or file is mmaped, it's mapping
needs to be removed and future access need to return SIGBUS almost like
truncate.  This case is rare enough that it barely gets tested, and
making a generic implementation warranted.  I have tried with sysfs but
a complete and generic implementation does not seem possible without mm
support.

It looks like a fully generic implementation with mm knowledge
is shorter and easier to get right than what I have in sysfs today.

So here is that fully generic implementation.

Eric W. Biederman (3):
      mm: Introduce revoke_mappings.
      mm: Consolidate vma destruction into remove_vma.
      mm: Cause revoke_mappings to wait until all close methods have completed.
---
 include/linux/fs.h |    2 +
 include/linux/mm.h |    2 +
 mm/Makefile        |    2 +-
 mm/mmap.c          |   34 +++++-----
 mm/nommu.c         |    5 ++
 mm/revoke.c        |  192 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 219 insertions(+), 18 deletions(-)

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] mm: Introduce revoke_mappings.
  2010-09-25 23:33 ` Eric W. Biederman
@ 2010-09-25 23:33   ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-25 23:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel


When the backing store of a file becomes inaccessible we need a function
to remove that file from the page tables and arrange for page faults
to trigger SIGBUS.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 include/linux/mm.h |    2 +
 mm/Makefile        |    2 +-
 mm/nommu.c         |    5 ++
 mm/revoke.c        |  180 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 188 insertions(+), 1 deletions(-)
 create mode 100644 mm/revoke.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..444544c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -828,6 +828,8 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
 
 int invalidate_inode_page(struct page *page);
 
+extern void revoke_mappings(struct address_space *mapping);
+
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
diff --git a/mm/Makefile b/mm/Makefile
index 34b2546..e741676 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o
+			   vmalloc.o pagewalk.o revoke.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   maccess.o page_alloc.o page-writeback.o \
diff --git a/mm/nommu.c b/mm/nommu.c
index 88ff091..3e8b5ec 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1785,6 +1785,11 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+void revoke_mappings(struct address_space *mapping)
+{
+}
+EXPORT_SYMBOL_GPL(revoke_mappings);
+
 /*
  * Check that a process has enough memory to allocate a new virtual
  * mapping. 0 means there is enough memory for the allocation to
diff --git a/mm/revoke.c b/mm/revoke.c
new file mode 100644
index 0000000..a76cd1a
--- /dev/null
+++ b/mm/revoke.c
@@ -0,0 +1,180 @@
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+
+#include "internal.h"
+
+/* Make revoked areas file backed */
+static struct file *revoked_filp;
+
+static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct revoked_vm_ops = {
+	.fault	= revoked_fault,
+};
+
+static int revoked_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	vma->vm_ops = &revoked_vm_ops;
+	return 0;
+}
+
+static const struct file_operations revoked_fops = {
+	.mmap	= revoked_mmap,
+};
+
+/* Flags preserved from the original revoked vma.
+ */
+#define REVOKED_VM_FLAGS (\
+	VM_READ		| \
+	VM_WRITE	| \
+	VM_EXEC		| \
+	VM_MAYREAD	| \
+	VM_MAYWRITE	| \
+	VM_MAYEXEC	| \
+	VM_DONTCOPY	| \
+	VM_NORESERVE	| \
+	0		)
+
+static void revoke_vma(struct vm_area_struct *old)
+{
+	/* Atomically replace a vma with an identical one that returns
+	 * VM_FAULT_SIGBUS to every mmap request.
+	 *
+	 * This function must be called with the mm->mmap semaphore held.
+	 */
+	unsigned long start, end, len, pgoff, vm_flags;
+	struct vm_area_struct *new;
+	struct mm_struct *mm;
+	struct file *file;
+
+	file  = revoked_filp;
+	mm    = old->vm_mm;
+	start = old->vm_start;
+	end   = old->vm_end;
+	len   = end - start;
+	pgoff = old->vm_pgoff;
+
+	/* Preserve user visble vm_flags. */
+	vm_flags = VM_SHARED | VM_MAYSHARE | (old->vm_flags & REVOKED_VM_FLAGS);
+
+	/* If kmem_cache_zalloc fails return and ultimately try again */
+	new = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+	if (!new)
+		goto out;
+
+	/* I am freeing exactly one vma so munmap should never fail.
+	 * If munmap fails return and ultimately try again.
+	 */
+	if (unlikely(do_munmap(mm, start, len)))
+		goto fail;
+
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	new->vm_mm    = mm;
+	new->vm_start = start;
+	new->vm_end   = end;
+	new->vm_flags = vm_flags;
+	new->vm_page_prot = vm_get_page_prot(vm_flags);
+	new->vm_pgoff = pgoff;
+	new->vm_file  = file;
+	get_file(file);
+	new->vm_ops   = &revoked_vm_ops;
+
+	/* Since the area was just umapped there is no excuse for
+	 * insert_vm_struct to fail.
+	 *
+	 * If insert_vm_struct fails we will cause a SIGSEGV instead
+	 * a SIGBUS.  A shame but not the end of the world.
+	 */
+	if (unlikely(insert_vm_struct(mm, new)))
+		goto fail;
+
+	mm->total_vm += len >> PAGE_SHIFT;
+
+	perf_event_mmap(new);
+
+	return;
+fail:
+	kmem_cache_free(vm_area_cachep, new);
+	WARN_ONCE(1, "%s failed\n", __func__);
+out:
+	return;
+}
+
+static bool revoke_mapping(struct address_space *mapping, struct mm_struct *mm,
+			   unsigned long addr)
+{
+	/* Returns true if the locks were dropped */
+	struct vm_area_struct *vma;
+
+	/*
+	 * Drop i_mmap_lock and grab the mm sempahore so I can call
+	 * revoke_vma. 
+	 */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return false;
+	spin_unlock(&mapping->i_mmap_lock);
+	down_write(&mm->mmap_sem);
+
+	/* There was a vma at mm, addr that needed to be revoked.
+	 * Look and see if there is still a vma there that needs
+	 * to be revoked.
+	 */
+	vma = find_vma(mm, addr);
+	if (vma->vm_file->f_mapping == mapping)
+		revoke_vma(vma);
+
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+	spin_lock(&mapping->i_mmap_lock);
+	return true;
+}
+
+void revoke_mappings(struct address_space *mapping)
+{
+	/* Make any access to previously mapped pages trigger a SIGBUS,
+	 * and stop calling vm_ops methods.
+	 *
+	 * When revoke_mappings returns invocations of vm_ops->close
+	 * may still be in progress, but no invocations of any other
+	 * vm_ops methods will be.
+	 */
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+
+	spin_lock(&mapping->i_mmap_lock);
+
+restart_tree:
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
+		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
+			goto restart_tree;
+	}
+
+restart_list:
+	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
+		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
+			goto restart_list;
+	}
+
+	spin_unlock(&mapping->i_mmap_lock);
+}
+EXPORT_SYMBOL_GPL(revoke_mappings);
+
+static int __init revoke_init(void)
+{
+	int ret = 0;
+	revoked_filp = anon_inode_getfile("[revoked]", &revoked_fops, NULL,
+					  O_RDWR /* do flags matter here? */);
+	if (IS_ERR(revoked_filp))
+		ret = PTR_ERR(revoked_filp);
+	return ret;
+}
+module_init(revoke_init);
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 1/3] mm: Introduce revoke_mappings.
@ 2010-09-25 23:33   ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-25 23:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel


When the backing store of a file becomes inaccessible we need a function
to remove that file from the page tables and arrange for page faults
to trigger SIGBUS.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 include/linux/mm.h |    2 +
 mm/Makefile        |    2 +-
 mm/nommu.c         |    5 ++
 mm/revoke.c        |  180 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 188 insertions(+), 1 deletions(-)
 create mode 100644 mm/revoke.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74949fb..444544c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -828,6 +828,8 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
 
 int invalidate_inode_page(struct page *page);
 
+extern void revoke_mappings(struct address_space *mapping);
+
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
diff --git a/mm/Makefile b/mm/Makefile
index 34b2546..e741676 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o
+			   vmalloc.o pagewalk.o revoke.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   maccess.o page_alloc.o page-writeback.o \
diff --git a/mm/nommu.c b/mm/nommu.c
index 88ff091..3e8b5ec 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1785,6 +1785,11 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+void revoke_mappings(struct address_space *mapping)
+{
+}
+EXPORT_SYMBOL_GPL(revoke_mappings);
+
 /*
  * Check that a process has enough memory to allocate a new virtual
  * mapping. 0 means there is enough memory for the allocation to
diff --git a/mm/revoke.c b/mm/revoke.c
new file mode 100644
index 0000000..a76cd1a
--- /dev/null
+++ b/mm/revoke.c
@@ -0,0 +1,180 @@
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+
+#include "internal.h"
+
+/* Make revoked areas file backed */
+static struct file *revoked_filp;
+
+static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct revoked_vm_ops = {
+	.fault	= revoked_fault,
+};
+
+static int revoked_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	vma->vm_ops = &revoked_vm_ops;
+	return 0;
+}
+
+static const struct file_operations revoked_fops = {
+	.mmap	= revoked_mmap,
+};
+
+/* Flags preserved from the original revoked vma.
+ */
+#define REVOKED_VM_FLAGS (\
+	VM_READ		| \
+	VM_WRITE	| \
+	VM_EXEC		| \
+	VM_MAYREAD	| \
+	VM_MAYWRITE	| \
+	VM_MAYEXEC	| \
+	VM_DONTCOPY	| \
+	VM_NORESERVE	| \
+	0		)
+
+static void revoke_vma(struct vm_area_struct *old)
+{
+	/* Atomically replace a vma with an identical one that returns
+	 * VM_FAULT_SIGBUS to every mmap request.
+	 *
+	 * This function must be called with the mm->mmap semaphore held.
+	 */
+	unsigned long start, end, len, pgoff, vm_flags;
+	struct vm_area_struct *new;
+	struct mm_struct *mm;
+	struct file *file;
+
+	file  = revoked_filp;
+	mm    = old->vm_mm;
+	start = old->vm_start;
+	end   = old->vm_end;
+	len   = end - start;
+	pgoff = old->vm_pgoff;
+
+	/* Preserve user visble vm_flags. */
+	vm_flags = VM_SHARED | VM_MAYSHARE | (old->vm_flags & REVOKED_VM_FLAGS);
+
+	/* If kmem_cache_zalloc fails return and ultimately try again */
+	new = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+	if (!new)
+		goto out;
+
+	/* I am freeing exactly one vma so munmap should never fail.
+	 * If munmap fails return and ultimately try again.
+	 */
+	if (unlikely(do_munmap(mm, start, len)))
+		goto fail;
+
+	INIT_LIST_HEAD(&new->anon_vma_chain);
+	new->vm_mm    = mm;
+	new->vm_start = start;
+	new->vm_end   = end;
+	new->vm_flags = vm_flags;
+	new->vm_page_prot = vm_get_page_prot(vm_flags);
+	new->vm_pgoff = pgoff;
+	new->vm_file  = file;
+	get_file(file);
+	new->vm_ops   = &revoked_vm_ops;
+
+	/* Since the area was just umapped there is no excuse for
+	 * insert_vm_struct to fail.
+	 *
+	 * If insert_vm_struct fails we will cause a SIGSEGV instead
+	 * a SIGBUS.  A shame but not the end of the world.
+	 */
+	if (unlikely(insert_vm_struct(mm, new)))
+		goto fail;
+
+	mm->total_vm += len >> PAGE_SHIFT;
+
+	perf_event_mmap(new);
+
+	return;
+fail:
+	kmem_cache_free(vm_area_cachep, new);
+	WARN_ONCE(1, "%s failed\n", __func__);
+out:
+	return;
+}
+
+static bool revoke_mapping(struct address_space *mapping, struct mm_struct *mm,
+			   unsigned long addr)
+{
+	/* Returns true if the locks were dropped */
+	struct vm_area_struct *vma;
+
+	/*
+	 * Drop i_mmap_lock and grab the mm sempahore so I can call
+	 * revoke_vma. 
+	 */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return false;
+	spin_unlock(&mapping->i_mmap_lock);
+	down_write(&mm->mmap_sem);
+
+	/* There was a vma at mm, addr that needed to be revoked.
+	 * Look and see if there is still a vma there that needs
+	 * to be revoked.
+	 */
+	vma = find_vma(mm, addr);
+	if (vma->vm_file->f_mapping == mapping)
+		revoke_vma(vma);
+
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+	spin_lock(&mapping->i_mmap_lock);
+	return true;
+}
+
+void revoke_mappings(struct address_space *mapping)
+{
+	/* Make any access to previously mapped pages trigger a SIGBUS,
+	 * and stop calling vm_ops methods.
+	 *
+	 * When revoke_mappings returns invocations of vm_ops->close
+	 * may still be in progress, but no invocations of any other
+	 * vm_ops methods will be.
+	 */
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+
+	spin_lock(&mapping->i_mmap_lock);
+
+restart_tree:
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
+		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
+			goto restart_tree;
+	}
+
+restart_list:
+	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
+		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
+			goto restart_list;
+	}
+
+	spin_unlock(&mapping->i_mmap_lock);
+}
+EXPORT_SYMBOL_GPL(revoke_mappings);
+
+static int __init revoke_init(void)
+{
+	int ret = 0;
+	revoked_filp = anon_inode_getfile("[revoked]", &revoked_fops, NULL,
+					  O_RDWR /* do flags matter here? */);
+	if (IS_ERR(revoked_filp))
+		ret = PTR_ERR(revoked_filp);
+	return ret;
+}
+module_init(revoke_init);
-- 
1.7.2.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
  2010-09-25 23:33 ` Eric W. Biederman
@ 2010-09-25 23:34   ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-25 23:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel


Consolidate vma destruction in remove_vma.   This is slightly
better for code size and for code maintenance.  Avoiding the pain
of 3 copies of everything needed to tear down a vma.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 mm/mmap.c |   21 +++++----------------
 1 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 6128dc8..17dd003 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -643,16 +643,10 @@ again:			remove_next = 1 + (end > next->vm_end);
 		spin_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
-		if (file) {
-			fput(file);
-			if (next->vm_flags & VM_EXECUTABLE)
-				removed_exe_file_vma(mm);
-		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
+		remove_vma(next);
 		mm->map_count--;
-		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -2002,19 +1996,14 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 		return 0;
 
 	/* Clean everything up if vma_adjust failed. */
-	if (new->vm_ops && new->vm_ops->close)
-		new->vm_ops->close(new);
-	if (new->vm_file) {
-		if (vma->vm_flags & VM_EXECUTABLE)
-			removed_exe_file_vma(mm);
-		fput(new->vm_file);
-	}
+	remove_vma(new);
+ out_err:
+	return err;
  out_free_mpol:
 	mpol_put(pol);
  out_free_vma:
 	kmem_cache_free(vm_area_cachep, new);
- out_err:
-	return err;
+	goto out_err;
 }
 
 /*
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
@ 2010-09-25 23:34   ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-25 23:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel


Consolidate vma destruction in remove_vma.   This is slightly
better for code size and for code maintenance.  Avoiding the pain
of 3 copies of everything needed to tear down a vma.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 mm/mmap.c |   21 +++++----------------
 1 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 6128dc8..17dd003 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -643,16 +643,10 @@ again:			remove_next = 1 + (end > next->vm_end);
 		spin_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
-		if (file) {
-			fput(file);
-			if (next->vm_flags & VM_EXECUTABLE)
-				removed_exe_file_vma(mm);
-		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
+		remove_vma(next);
 		mm->map_count--;
-		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -2002,19 +1996,14 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 		return 0;
 
 	/* Clean everything up if vma_adjust failed. */
-	if (new->vm_ops && new->vm_ops->close)
-		new->vm_ops->close(new);
-	if (new->vm_file) {
-		if (vma->vm_flags & VM_EXECUTABLE)
-			removed_exe_file_vma(mm);
-		fput(new->vm_file);
-	}
+	remove_vma(new);
+ out_err:
+	return err;
  out_free_mpol:
 	mpol_put(pol);
  out_free_vma:
 	kmem_cache_free(vm_area_cachep, new);
- out_err:
-	return err;
+	goto out_err;
 }
 
 /*
-- 
1.7.2.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] mm: Cause revoke_mappings to wait until all close methods have completed.
  2010-09-25 23:33 ` Eric W. Biederman
@ 2010-09-25 23:34   ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-25 23:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel


Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 include/linux/fs.h |    2 ++
 mm/mmap.c          |   13 ++++++++++++-
 mm/revoke.c        |   18 +++++++++++++++---
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76041b6..5d3d6b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -633,6 +633,8 @@ struct address_space {
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+	struct task_struct	*revoke_task;	/* Who to wake up when all vmas are closed */
+	unsigned int		close_count;	/* Cover race conditions with revoke_mappings */
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
diff --git a/mm/mmap.c b/mm/mmap.c
index 17dd003..3df3193 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -218,6 +218,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
 		struct address_space *mapping = file->f_mapping;
 		spin_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
+		mapping->close_count++;
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 }
@@ -233,9 +234,19 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file) {
-		fput(vma->vm_file);
+		struct address_space *mapping = vma->vm_file->f_mapping;
 		if (vma->vm_flags & VM_EXECUTABLE)
 			removed_exe_file_vma(vma->vm_mm);
+
+		/* Decrement the close count and wake up a revoker if present */
+		spin_lock(&mapping->i_mmap_lock);
+		mapping->close_count--;
+		if ((mapping->close_count == 0) && mapping->revoke_task)
+			/* Is wake_up_process the right variant of try_to_wake_up? */
+			wake_up_process(mapping->revoke_task);
+		spin_unlock(&mapping->i_mmap_lock);
+
+		fput(vma->vm_file);
 	}
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
diff --git a/mm/revoke.c b/mm/revoke.c
index a76cd1a..e19f7df 100644
--- a/mm/revoke.c
+++ b/mm/revoke.c
@@ -143,15 +143,17 @@ void revoke_mappings(struct address_space *mapping)
 	/* Make any access to previously mapped pages trigger a SIGBUS,
 	 * and stop calling vm_ops methods.
 	 *
-	 * When revoke_mappings returns invocations of vm_ops->close
-	 * may still be in progress, but no invocations of any other
-	 * vm_ops methods will be.
+	 * When revoke_mappings no invocations of any method will be
+	 * in progress.
 	 */
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
 
 	spin_lock(&mapping->i_mmap_lock);
 
+	WARN_ON(mapping->revoke_task);
+	mapping->revoke_task = current;
+
 restart_tree:
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
 		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
@@ -164,6 +166,16 @@ restart_list:
 			goto restart_list;
 	}
 
+	while (!list_empty(&mapping->i_mmap_nonlinear) ||
+	       !prio_tree_empty(&mapping->i_mmap) ||
+	       mapping->close_count)
+	{
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		spin_unlock(&mapping->i_mmap_lock);
+		schedule();
+		spin_lock(&mapping->i_mmap_lock);
+	}
+	mapping->revoke_task = NULL;
 	spin_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL_GPL(revoke_mappings);
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] mm: Cause revoke_mappings to wait until all close methods have completed.
@ 2010-09-25 23:34   ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-25 23:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel


Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 include/linux/fs.h |    2 ++
 mm/mmap.c          |   13 ++++++++++++-
 mm/revoke.c        |   18 +++++++++++++++---
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76041b6..5d3d6b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -633,6 +633,8 @@ struct address_space {
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+	struct task_struct	*revoke_task;	/* Who to wake up when all vmas are closed */
+	unsigned int		close_count;	/* Cover race conditions with revoke_mappings */
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
diff --git a/mm/mmap.c b/mm/mmap.c
index 17dd003..3df3193 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -218,6 +218,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
 		struct address_space *mapping = file->f_mapping;
 		spin_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
+		mapping->close_count++;
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 }
@@ -233,9 +234,19 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file) {
-		fput(vma->vm_file);
+		struct address_space *mapping = vma->vm_file->f_mapping;
 		if (vma->vm_flags & VM_EXECUTABLE)
 			removed_exe_file_vma(vma->vm_mm);
+
+		/* Decrement the close count and wake up a revoker if present */
+		spin_lock(&mapping->i_mmap_lock);
+		mapping->close_count--;
+		if ((mapping->close_count == 0) && mapping->revoke_task)
+			/* Is wake_up_process the right variant of try_to_wake_up? */
+			wake_up_process(mapping->revoke_task);
+		spin_unlock(&mapping->i_mmap_lock);
+
+		fput(vma->vm_file);
 	}
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
diff --git a/mm/revoke.c b/mm/revoke.c
index a76cd1a..e19f7df 100644
--- a/mm/revoke.c
+++ b/mm/revoke.c
@@ -143,15 +143,17 @@ void revoke_mappings(struct address_space *mapping)
 	/* Make any access to previously mapped pages trigger a SIGBUS,
 	 * and stop calling vm_ops methods.
 	 *
-	 * When revoke_mappings returns invocations of vm_ops->close
-	 * may still be in progress, but no invocations of any other
-	 * vm_ops methods will be.
+	 * When revoke_mappings no invocations of any method will be
+	 * in progress.
 	 */
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
 
 	spin_lock(&mapping->i_mmap_lock);
 
+	WARN_ON(mapping->revoke_task);
+	mapping->revoke_task = current;
+
 restart_tree:
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
 		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
@@ -164,6 +166,16 @@ restart_list:
 			goto restart_list;
 	}
 
+	while (!list_empty(&mapping->i_mmap_nonlinear) ||
+	       !prio_tree_empty(&mapping->i_mmap) ||
+	       mapping->close_count)
+	{
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		spin_unlock(&mapping->i_mmap_lock);
+		schedule();
+		spin_lock(&mapping->i_mmap_lock);
+	}
+	mapping->revoke_task = NULL;
 	spin_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL_GPL(revoke_mappings);
-- 
1.7.2.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
  2010-09-25 23:34   ` Eric W. Biederman
@ 2010-09-27  6:37     ` Pekka Enberg
  -1 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27  6:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Hugh Dickins, linux-mm, linux-kernel, Rik van Riel

Hi Eric,

On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Consolidate vma destruction in remove_vma.   This is slightly
> better for code size and for code maintenance.  Avoiding the pain
> of 3 copies of everything needed to tear down a vma.
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
> ---
>  mm/mmap.c |   21 +++++----------------
>  1 files changed, 5 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 6128dc8..17dd003 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -643,16 +643,10 @@ again:                    remove_next = 1 + (end > next->vm_end);
>                spin_unlock(&mapping->i_mmap_lock);
>
>        if (remove_next) {
> -               if (file) {
> -                       fput(file);
> -                       if (next->vm_flags & VM_EXECUTABLE)
> -                               removed_exe_file_vma(mm);
> -               }
>                if (next->anon_vma)
>                        anon_vma_merge(vma, next);
> +               remove_vma(next);

remove_vma() does vma->vm_ops->close() but we don't do that here. Are
you sure the conversion is safe?

>                mm->map_count--;
> -               mpol_put(vma_policy(next));
> -               kmem_cache_free(vm_area_cachep, next);
>                /*
>                 * In mprotect's case 6 (see comments on vma_merge),
>                 * we must remove another next too. It would clutter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
@ 2010-09-27  6:37     ` Pekka Enberg
  0 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27  6:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Hugh Dickins, linux-mm, linux-kernel, Rik van Riel

Hi Eric,

On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Consolidate vma destruction in remove_vma.   This is slightly
> better for code size and for code maintenance.  Avoiding the pain
> of 3 copies of everything needed to tear down a vma.
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
> ---
>  mm/mmap.c |   21 +++++----------------
>  1 files changed, 5 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 6128dc8..17dd003 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -643,16 +643,10 @@ again:                    remove_next = 1 + (end > next->vm_end);
>                spin_unlock(&mapping->i_mmap_lock);
>
>        if (remove_next) {
> -               if (file) {
> -                       fput(file);
> -                       if (next->vm_flags & VM_EXECUTABLE)
> -                               removed_exe_file_vma(mm);
> -               }
>                if (next->anon_vma)
>                        anon_vma_merge(vma, next);
> +               remove_vma(next);

remove_vma() does vma->vm_ops->close() but we don't do that here. Are
you sure the conversion is safe?

>                mm->map_count--;
> -               mpol_put(vma_policy(next));
> -               kmem_cache_free(vm_area_cachep, next);
>                /*
>                 * In mprotect's case 6 (see comments on vma_merge),
>                 * we must remove another next too. It would clutter

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
  2010-09-27  6:37     ` Pekka Enberg
@ 2010-09-27  6:39       ` Pekka Enberg
  -1 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27  6:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

(Fixing Hugh's email address.)

On Mon, Sep 27, 2010 at 9:37 AM, Pekka Enberg <penberg@kernel.org> wrote:
> Hi Eric,
>
> On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Consolidate vma destruction in remove_vma.   This is slightly
>> better for code size and for code maintenance.  Avoiding the pain
>> of 3 copies of everything needed to tear down a vma.
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>> ---
>>  mm/mmap.c |   21 +++++----------------
>>  1 files changed, 5 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 6128dc8..17dd003 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -643,16 +643,10 @@ again:                    remove_next = 1 + (end > next->vm_end);
>>                spin_unlock(&mapping->i_mmap_lock);
>>
>>        if (remove_next) {
>> -               if (file) {
>> -                       fput(file);
>> -                       if (next->vm_flags & VM_EXECUTABLE)
>> -                               removed_exe_file_vma(mm);
>> -               }
>>                if (next->anon_vma)
>>                        anon_vma_merge(vma, next);
>> +               remove_vma(next);
>
> remove_vma() does vma->vm_ops->close() but we don't do that here. Are
> you sure the conversion is safe?
>
>>                mm->map_count--;
>> -               mpol_put(vma_policy(next));
>> -               kmem_cache_free(vm_area_cachep, next);
>>                /*
>>                 * In mprotect's case 6 (see comments on vma_merge),
>>                 * we must remove another next too. It would clutter
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
@ 2010-09-27  6:39       ` Pekka Enberg
  0 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27  6:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

(Fixing Hugh's email address.)

On Mon, Sep 27, 2010 at 9:37 AM, Pekka Enberg <penberg@kernel.org> wrote:
> Hi Eric,
>
> On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Consolidate vma destruction in remove_vma.   This is slightly
>> better for code size and for code maintenance.  Avoiding the pain
>> of 3 copies of everything needed to tear down a vma.
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>> ---
>>  mm/mmap.c |   21 +++++----------------
>>  1 files changed, 5 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 6128dc8..17dd003 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -643,16 +643,10 @@ again:                    remove_next = 1 + (end > next->vm_end);
>>                spin_unlock(&mapping->i_mmap_lock);
>>
>>        if (remove_next) {
>> -               if (file) {
>> -                       fput(file);
>> -                       if (next->vm_flags & VM_EXECUTABLE)
>> -                               removed_exe_file_vma(mm);
>> -               }
>>                if (next->anon_vma)
>>                        anon_vma_merge(vma, next);
>> +               remove_vma(next);
>
> remove_vma() does vma->vm_ops->close() but we don't do that here. Are
> you sure the conversion is safe?
>
>>                mm->map_count--;
>> -               mpol_put(vma_policy(next));
>> -               kmem_cache_free(vm_area_cachep, next);
>>                /*
>>                 * In mprotect's case 6 (see comments on vma_merge),
>>                 * we must remove another next too. It would clutter
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
  2010-09-27  6:39       ` Pekka Enberg
@ 2010-09-27  6:44         ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-27  6:44 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

Pekka Enberg <penberg@kernel.org> writes:

> (Fixing Hugh's email address.)

Sorry about that somehow a typo crept it.

> On Mon, Sep 27, 2010 at 9:37 AM, Pekka Enberg <penberg@kernel.org> wrote:
>> Hi Eric,
>>
>> On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Consolidate vma destruction in remove_vma.   This is slightly
>>> better for code size and for code maintenance.  Avoiding the pain
>>> of 3 copies of everything needed to tear down a vma.
>>>
>>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>>> ---
>>>  mm/mmap.c |   21 +++++----------------
>>>  1 files changed, 5 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 6128dc8..17dd003 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -643,16 +643,10 @@ again:                    remove_next = 1 + (end > next->vm_end);
>>>                spin_unlock(&mapping->i_mmap_lock);
>>>
>>>        if (remove_next) {
>>> -               if (file) {
>>> -                       fput(file);
>>> -                       if (next->vm_flags & VM_EXECUTABLE)
>>> -                               removed_exe_file_vma(mm);
>>> -               }
>>>                if (next->anon_vma)
>>>                        anon_vma_merge(vma, next);
>>> +               remove_vma(next);
>>
>> remove_vma() does vma->vm_ops->close() but we don't do that here. Are
>> you sure the conversion is safe?

Definitely.  It actually isn't possible to reach that point with a
vma that has a close method.

Until I had traced through all of the code paths I suspect calling
remove_vma there might have been a bug fix.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
@ 2010-09-27  6:44         ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-27  6:44 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

Pekka Enberg <penberg@kernel.org> writes:

> (Fixing Hugh's email address.)

Sorry about that somehow a typo crept it.

> On Mon, Sep 27, 2010 at 9:37 AM, Pekka Enberg <penberg@kernel.org> wrote:
>> Hi Eric,
>>
>> On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Consolidate vma destruction in remove_vma.   This is slightly
>>> better for code size and for code maintenance.  Avoiding the pain
>>> of 3 copies of everything needed to tear down a vma.
>>>
>>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>>> ---
>>>  mm/mmap.c |   21 +++++----------------
>>>  1 files changed, 5 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 6128dc8..17dd003 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -643,16 +643,10 @@ again:                    remove_next = 1 + (end > next->vm_end);
>>>                spin_unlock(&mapping->i_mmap_lock);
>>>
>>>        if (remove_next) {
>>> -               if (file) {
>>> -                       fput(file);
>>> -                       if (next->vm_flags & VM_EXECUTABLE)
>>> -                               removed_exe_file_vma(mm);
>>> -               }
>>>                if (next->anon_vma)
>>>                        anon_vma_merge(vma, next);
>>> +               remove_vma(next);
>>
>> remove_vma() does vma->vm_ops->close() but we don't do that here. Are
>> you sure the conversion is safe?

Definitely.  It actually isn't possible to reach that point with a
vma that has a close method.

Until I had traced through all of the code paths I suspect calling
remove_vma there might have been a bug fix.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
  2010-09-27  6:44         ` Eric W. Biederman
@ 2010-09-27  7:11           ` Pekka Enberg
  -1 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27  7:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

On Mon, Sep 27, 2010 at 9:37 AM, Pekka Enberg <penberg@kernel.org> wrote:
>>> On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Consolidate vma destruction in remove_vma.   This is slightly
>>>> better for code size and for code maintenance.  Avoiding the pain
>>>> of 3 copies of everything needed to tear down a vma.
>>>>
>>>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>>>> ---
>>>>  mm/mmap.c |   21 +++++----------------
>>>>  1 files changed, 5 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>> index 6128dc8..17dd003 100644
>>>> --- a/mm/mmap.c
>>>> +++ b/mm/mmap.c
>>>> @@ -643,16 +643,10 @@ again:                    remove_next = 1 + (end > next->vm_end);
>>>>                spin_unlock(&mapping->i_mmap_lock);
>>>>
>>>>        if (remove_next) {
>>>> -               if (file) {
>>>> -                       fput(file);
>>>> -                       if (next->vm_flags & VM_EXECUTABLE)
>>>> -                               removed_exe_file_vma(mm);
>>>> -               }
>>>>                if (next->anon_vma)
>>>>                        anon_vma_merge(vma, next);
>>>> +               remove_vma(next);
>>>
>>> remove_vma() does vma->vm_ops->close() but we don't do that here. Are
>>> you sure the conversion is safe?
>
> Definitely.  It actually isn't possible to reach that point with a
> vma that has a close method.
>
> Until I had traced through all of the code paths I suspect calling
> remove_vma there might have been a bug fix.

Can we amend that to the changelog, please? Otherwise

Acked-by: Pekka Enberg <penberg@kernel.org>

                        Pekka

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: Consolidate vma destruction into remove_vma.
@ 2010-09-27  7:11           ` Pekka Enberg
  0 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27  7:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

On Mon, Sep 27, 2010 at 9:37 AM, Pekka Enberg <penberg@kernel.org> wrote:
>>> On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Consolidate vma destruction in remove_vma.   This is slightly
>>>> better for code size and for code maintenance.  Avoiding the pain
>>>> of 3 copies of everything needed to tear down a vma.
>>>>
>>>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>>>> ---
>>>>  mm/mmap.c |   21 +++++----------------
>>>>  1 files changed, 5 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>> index 6128dc8..17dd003 100644
>>>> --- a/mm/mmap.c
>>>> +++ b/mm/mmap.c
>>>> @@ -643,16 +643,10 @@ again:                    remove_next = 1 + (end > next->vm_end);
>>>>                spin_unlock(&mapping->i_mmap_lock);
>>>>
>>>>        if (remove_next) {
>>>> -               if (file) {
>>>> -                       fput(file);
>>>> -                       if (next->vm_flags & VM_EXECUTABLE)
>>>> -                               removed_exe_file_vma(mm);
>>>> -               }
>>>>                if (next->anon_vma)
>>>>                        anon_vma_merge(vma, next);
>>>> +               remove_vma(next);
>>>
>>> remove_vma() does vma->vm_ops->close() but we don't do that here. Are
>>> you sure the conversion is safe?
>
> Definitely.  It actually isn't possible to reach that point with a
> vma that has a close method.
>
> Until I had traced through all of the code paths I suspect calling
> remove_vma there might have been a bug fix.

Can we amend that to the changelog, please? Otherwise

Acked-by: Pekka Enberg <penberg@kernel.org>

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] mm: Cause revoke_mappings to wait until all close methods have completed.
  2010-09-25 23:34   ` Eric W. Biederman
@ 2010-09-27  7:15     ` Pekka Enberg
  -1 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27  7:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>

The changelog is slightly too terse for me to review this. What's the
race you're trying to avoid? If the point is to make the revoke task
wait until everything has been closed, why can't we use the completion
API here?

> ---
>  include/linux/fs.h |    2 ++
>  mm/mmap.c          |   13 ++++++++++++-
>  mm/revoke.c        |   18 +++++++++++++++---
>  3 files changed, 29 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 76041b6..5d3d6b8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -633,6 +633,8 @@ struct address_space {
>        const struct address_space_operations *a_ops;   /* methods */
>        unsigned long           flags;          /* error bits/gfp mask */
>        struct backing_dev_info *backing_dev_info; /* device readahead, etc */
> +       struct task_struct      *revoke_task;   /* Who to wake up when all vmas are closed */
> +       unsigned int            close_count;    /* Cover race conditions with revoke_mappings */
>        spinlock_t              private_lock;   /* for use by the address_space */
>        struct list_head        private_list;   /* ditto */
>        struct address_space    *assoc_mapping; /* ditto */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 17dd003..3df3193 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -218,6 +218,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
>                struct address_space *mapping = file->f_mapping;
>                spin_lock(&mapping->i_mmap_lock);
>                __remove_shared_vm_struct(vma, file, mapping);
> +               mapping->close_count++;
>                spin_unlock(&mapping->i_mmap_lock);
>        }
>  }
> @@ -233,9 +234,19 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>        if (vma->vm_ops && vma->vm_ops->close)
>                vma->vm_ops->close(vma);
>        if (vma->vm_file) {
> -               fput(vma->vm_file);
> +               struct address_space *mapping = vma->vm_file->f_mapping;
>                if (vma->vm_flags & VM_EXECUTABLE)
>                        removed_exe_file_vma(vma->vm_mm);
> +
> +               /* Decrement the close count and wake up a revoker if present */
> +               spin_lock(&mapping->i_mmap_lock);
> +               mapping->close_count--;
> +               if ((mapping->close_count == 0) && mapping->revoke_task)
> +                       /* Is wake_up_process the right variant of try_to_wake_up? */
> +                       wake_up_process(mapping->revoke_task);
> +               spin_unlock(&mapping->i_mmap_lock);
> +
> +               fput(vma->vm_file);
>        }
>        mpol_put(vma_policy(vma));
>        kmem_cache_free(vm_area_cachep, vma);
> diff --git a/mm/revoke.c b/mm/revoke.c
> index a76cd1a..e19f7df 100644
> --- a/mm/revoke.c
> +++ b/mm/revoke.c
> @@ -143,15 +143,17 @@ void revoke_mappings(struct address_space *mapping)
>        /* Make any access to previously mapped pages trigger a SIGBUS,
>         * and stop calling vm_ops methods.
>         *
> -        * When revoke_mappings returns invocations of vm_ops->close
> -        * may still be in progress, but no invocations of any other
> -        * vm_ops methods will be.
> +        * When revoke_mappings no invocations of any method will be
> +        * in progress.
>         */
>        struct vm_area_struct *vma;
>        struct prio_tree_iter iter;
>
>        spin_lock(&mapping->i_mmap_lock);
>
> +       WARN_ON(mapping->revoke_task);
> +       mapping->revoke_task = current;
> +
>  restart_tree:
>        vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
>                if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
> @@ -164,6 +166,16 @@ restart_list:
>                        goto restart_list;
>        }
>
> +       while (!list_empty(&mapping->i_mmap_nonlinear) ||
> +              !prio_tree_empty(&mapping->i_mmap) ||
> +              mapping->close_count)
> +       {
> +               __set_current_state(TASK_UNINTERRUPTIBLE);
> +               spin_unlock(&mapping->i_mmap_lock);
> +               schedule();
> +               spin_lock(&mapping->i_mmap_lock);
> +       }
> +       mapping->revoke_task = NULL;
>        spin_unlock(&mapping->i_mmap_lock);
>  }
>  EXPORT_SYMBOL_GPL(revoke_mappings);
> --
> 1.7.2.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] mm: Cause revoke_mappings to wait until all close methods have completed.
@ 2010-09-27  7:15     ` Pekka Enberg
  0 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27  7:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>

The changelog is slightly too terse for me to review this. What's the
race you're trying to avoid? If the point is to make the revoke task
wait until everything has been closed, why can't we use the completion
API here?

> ---
>  include/linux/fs.h |    2 ++
>  mm/mmap.c          |   13 ++++++++++++-
>  mm/revoke.c        |   18 +++++++++++++++---
>  3 files changed, 29 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 76041b6..5d3d6b8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -633,6 +633,8 @@ struct address_space {
>        const struct address_space_operations *a_ops;   /* methods */
>        unsigned long           flags;          /* error bits/gfp mask */
>        struct backing_dev_info *backing_dev_info; /* device readahead, etc */
> +       struct task_struct      *revoke_task;   /* Who to wake up when all vmas are closed */
> +       unsigned int            close_count;    /* Cover race conditions with revoke_mappings */
>        spinlock_t              private_lock;   /* for use by the address_space */
>        struct list_head        private_list;   /* ditto */
>        struct address_space    *assoc_mapping; /* ditto */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 17dd003..3df3193 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -218,6 +218,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
>                struct address_space *mapping = file->f_mapping;
>                spin_lock(&mapping->i_mmap_lock);
>                __remove_shared_vm_struct(vma, file, mapping);
> +               mapping->close_count++;
>                spin_unlock(&mapping->i_mmap_lock);
>        }
>  }
> @@ -233,9 +234,19 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>        if (vma->vm_ops && vma->vm_ops->close)
>                vma->vm_ops->close(vma);
>        if (vma->vm_file) {
> -               fput(vma->vm_file);
> +               struct address_space *mapping = vma->vm_file->f_mapping;
>                if (vma->vm_flags & VM_EXECUTABLE)
>                        removed_exe_file_vma(vma->vm_mm);
> +
> +               /* Decrement the close count and wake up a revoker if present */
> +               spin_lock(&mapping->i_mmap_lock);
> +               mapping->close_count--;
> +               if ((mapping->close_count == 0) && mapping->revoke_task)
> +                       /* Is wake_up_process the right variant of try_to_wake_up? */
> +                       wake_up_process(mapping->revoke_task);
> +               spin_unlock(&mapping->i_mmap_lock);
> +
> +               fput(vma->vm_file);
>        }
>        mpol_put(vma_policy(vma));
>        kmem_cache_free(vm_area_cachep, vma);
> diff --git a/mm/revoke.c b/mm/revoke.c
> index a76cd1a..e19f7df 100644
> --- a/mm/revoke.c
> +++ b/mm/revoke.c
> @@ -143,15 +143,17 @@ void revoke_mappings(struct address_space *mapping)
>        /* Make any access to previously mapped pages trigger a SIGBUS,
>         * and stop calling vm_ops methods.
>         *
> -        * When revoke_mappings returns invocations of vm_ops->close
> -        * may still be in progress, but no invocations of any other
> -        * vm_ops methods will be.
> +        * When revoke_mappings no invocations of any method will be
> +        * in progress.
>         */
>        struct vm_area_struct *vma;
>        struct prio_tree_iter iter;
>
>        spin_lock(&mapping->i_mmap_lock);
>
> +       WARN_ON(mapping->revoke_task);
> +       mapping->revoke_task = current;
> +
>  restart_tree:
>        vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
>                if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
> @@ -164,6 +166,16 @@ restart_list:
>                        goto restart_list;
>        }
>
> +       while (!list_empty(&mapping->i_mmap_nonlinear) ||
> +              !prio_tree_empty(&mapping->i_mmap) ||
> +              mapping->close_count)
> +       {
> +               __set_current_state(TASK_UNINTERRUPTIBLE);
> +               spin_unlock(&mapping->i_mmap_lock);
> +               schedule();
> +               spin_lock(&mapping->i_mmap_lock);
> +       }
> +       mapping->revoke_task = NULL;
>        spin_unlock(&mapping->i_mmap_lock);
>  }
>  EXPORT_SYMBOL_GPL(revoke_mappings);
> --
> 1.7.2.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] mm: Cause revoke_mappings to wait until all close methods have completed.
  2010-09-27  7:15     ` Pekka Enberg
@ 2010-09-27  8:04       ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-27  8:04 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

Pekka Enberg <penberg@kernel.org> writes:

> On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>
> The changelog is slightly too terse for me to review this. What's the
> race you're trying to avoid? If the point is to make the revoke task
> wait until everything has been closed, why can't we use the completion
> API here?

Because as far as I can tell completions don't map at all.
At best we have are times when the set of vmas goes empty.

The close_count is needed as we take vmas off the lists early
to avoid issues with truncate, and so we something to tell
when the close is actually finished.  If a close is actually
in progress.

I used a simple task to wake up instead of a wait queue as in all of the
revoke scenarios I know of, it make sense to serialize at a higher
level, and a task pointer is smaller than a wait queue head, and
I am reluctant to increase the size of struct inode to larger than
necessary.

The count at least has to be always present because objects could start
closing before we start the revoke.   We can't be ham handed and grab
all mm->mmap_sem's because mmput() revoke_vma is called without the
mmap_sem.

So it looked to me that the cleanest and smallest way to go was to write
an old fashioned schedule/wait loop that are usually hidden behind
completions, or wait queue logic.

It is my hope that we can be clever at some point and create a union
either in struct inode proper or in struct address space with a few
other fields that cannot be used while revoking a vma (like potentially
the truncate count) and reuse them.  But I am not clever enough today
to do something like that.

Eric

>> ---
>>  include/linux/fs.h |    2 ++
>>  mm/mmap.c          |   13 ++++++++++++-
>>  mm/revoke.c        |   18 +++++++++++++++---
>>  3 files changed, 29 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 76041b6..5d3d6b8 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -633,6 +633,8 @@ struct address_space {
>>        const struct address_space_operations *a_ops;   /* methods */
>>        unsigned long           flags;          /* error bits/gfp mask */
>>        struct backing_dev_info *backing_dev_info; /* device readahead, etc */
>> +       struct task_struct      *revoke_task;   /* Who to wake up when all vmas are closed */
>> +       unsigned int            close_count;    /* Cover race conditions with revoke_mappings */
>>        spinlock_t              private_lock;   /* for use by the address_space */
>>        struct list_head        private_list;   /* ditto */
>>        struct address_space    *assoc_mapping; /* ditto */
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 17dd003..3df3193 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -218,6 +218,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
>>                struct address_space *mapping = file->f_mapping;
>>                spin_lock(&mapping->i_mmap_lock);
>>                __remove_shared_vm_struct(vma, file, mapping);
>> +               mapping->close_count++;
>>                spin_unlock(&mapping->i_mmap_lock);
>>        }
>>  }
>> @@ -233,9 +234,19 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>>        if (vma->vm_ops && vma->vm_ops->close)
>>                vma->vm_ops->close(vma);
>>        if (vma->vm_file) {
>> -               fput(vma->vm_file);
>> +               struct address_space *mapping = vma->vm_file->f_mapping;
>>                if (vma->vm_flags & VM_EXECUTABLE)
>>                        removed_exe_file_vma(vma->vm_mm);
>> +
>> +               /* Decrement the close count and wake up a revoker if present */
>> +               spin_lock(&mapping->i_mmap_lock);
>> +               mapping->close_count--;
>> +               if ((mapping->close_count == 0) && mapping->revoke_task)
>> +                       /* Is wake_up_process the right variant of try_to_wake_up? */
>> +                       wake_up_process(mapping->revoke_task);
>> +               spin_unlock(&mapping->i_mmap_lock);
>> +
>> +               fput(vma->vm_file);
>>        }
>>        mpol_put(vma_policy(vma));
>>        kmem_cache_free(vm_area_cachep, vma);
>> diff --git a/mm/revoke.c b/mm/revoke.c
>> index a76cd1a..e19f7df 100644
>> --- a/mm/revoke.c
>> +++ b/mm/revoke.c
>> @@ -143,15 +143,17 @@ void revoke_mappings(struct address_space *mapping)
>>        /* Make any access to previously mapped pages trigger a SIGBUS,
>>         * and stop calling vm_ops methods.
>>         *
>> -        * When revoke_mappings returns invocations of vm_ops->close
>> -        * may still be in progress, but no invocations of any other
>> -        * vm_ops methods will be.
>> +        * When revoke_mappings no invocations of any method will be
>> +        * in progress.
>>         */
>>        struct vm_area_struct *vma;
>>        struct prio_tree_iter iter;
>>
>>        spin_lock(&mapping->i_mmap_lock);
>>
>> +       WARN_ON(mapping->revoke_task);
>> +       mapping->revoke_task = current;
>> +
>>  restart_tree:
>>        vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
>>                if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
>> @@ -164,6 +166,16 @@ restart_list:
>>                        goto restart_list;
>>        }
>>
>> +       while (!list_empty(&mapping->i_mmap_nonlinear) ||
>> +              !prio_tree_empty(&mapping->i_mmap) ||
>> +              mapping->close_count)
>> +       {
>> +               __set_current_state(TASK_UNINTERRUPTIBLE);
>> +               spin_unlock(&mapping->i_mmap_lock);
>> +               schedule();
>> +               spin_lock(&mapping->i_mmap_lock);
>> +       }
>> +       mapping->revoke_task = NULL;
>>        spin_unlock(&mapping->i_mmap_lock);
>>  }
>>  EXPORT_SYMBOL_GPL(revoke_mappings);
>> --
>> 1.7.2.3
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] mm: Cause revoke_mappings to wait until all close methods have completed.
@ 2010-09-27  8:04       ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-27  8:04 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

Pekka Enberg <penberg@kernel.org> writes:

> On Sun, Sep 26, 2010 at 2:34 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>
> The changelog is slightly too terse for me to review this. What's the
> race you're trying to avoid? If the point is to make the revoke task
> wait until everything has been closed, why can't we use the completion
> API here?

Because as far as I can tell completions don't map at all.
At best we have are times when the set of vmas goes empty.

The close_count is needed as we take vmas off the lists early
to avoid issues with truncate, and so we something to tell
when the close is actually finished.  If a close is actually
in progress.

I used a simple task to wake up instead of a wait queue as in all of the
revoke scenarios I know of, it make sense to serialize at a higher
level, and a task pointer is smaller than a wait queue head, and
I am reluctant to increase the size of struct inode to larger than
necessary.

The count at least has to be always present because objects could start
closing before we start the revoke.   We can't be ham handed and grab
all mm->mmap_sem's because mmput() revoke_vma is called without the
mmap_sem.

So it looked to me that the cleanest and smallest way to go was to write
an old fashioned schedule/wait loop that are usually hidden behind
completions, or wait queue logic.

It is my hope that we can be clever at some point and create a union
either in struct inode proper or in struct address space with a few
other fields that cannot be used while revoking a vma (like potentially
the truncate count) and reuse them.  But I am not clever enough today
to do something like that.

Eric

>> ---
>>  include/linux/fs.h |    2 ++
>>  mm/mmap.c          |   13 ++++++++++++-
>>  mm/revoke.c        |   18 +++++++++++++++---
>>  3 files changed, 29 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 76041b6..5d3d6b8 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -633,6 +633,8 @@ struct address_space {
>>        const struct address_space_operations *a_ops;   /* methods */
>>        unsigned long           flags;          /* error bits/gfp mask */
>>        struct backing_dev_info *backing_dev_info; /* device readahead, etc */
>> +       struct task_struct      *revoke_task;   /* Who to wake up when all vmas are closed */
>> +       unsigned int            close_count;    /* Cover race conditions with revoke_mappings */
>>        spinlock_t              private_lock;   /* for use by the address_space */
>>        struct list_head        private_list;   /* ditto */
>>        struct address_space    *assoc_mapping; /* ditto */
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 17dd003..3df3193 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -218,6 +218,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
>>                struct address_space *mapping = file->f_mapping;
>>                spin_lock(&mapping->i_mmap_lock);
>>                __remove_shared_vm_struct(vma, file, mapping);
>> +               mapping->close_count++;
>>                spin_unlock(&mapping->i_mmap_lock);
>>        }
>>  }
>> @@ -233,9 +234,19 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>>        if (vma->vm_ops && vma->vm_ops->close)
>>                vma->vm_ops->close(vma);
>>        if (vma->vm_file) {
>> -               fput(vma->vm_file);
>> +               struct address_space *mapping = vma->vm_file->f_mapping;
>>                if (vma->vm_flags & VM_EXECUTABLE)
>>                        removed_exe_file_vma(vma->vm_mm);
>> +
>> +               /* Decrement the close count and wake up a revoker if present */
>> +               spin_lock(&mapping->i_mmap_lock);
>> +               mapping->close_count--;
>> +               if ((mapping->close_count == 0) && mapping->revoke_task)
>> +                       /* Is wake_up_process the right variant of try_to_wake_up? */
>> +                       wake_up_process(mapping->revoke_task);
>> +               spin_unlock(&mapping->i_mmap_lock);
>> +
>> +               fput(vma->vm_file);
>>        }
>>        mpol_put(vma_policy(vma));
>>        kmem_cache_free(vm_area_cachep, vma);
>> diff --git a/mm/revoke.c b/mm/revoke.c
>> index a76cd1a..e19f7df 100644
>> --- a/mm/revoke.c
>> +++ b/mm/revoke.c
>> @@ -143,15 +143,17 @@ void revoke_mappings(struct address_space *mapping)
>>        /* Make any access to previously mapped pages trigger a SIGBUS,
>>         * and stop calling vm_ops methods.
>>         *
>> -        * When revoke_mappings returns invocations of vm_ops->close
>> -        * may still be in progress, but no invocations of any other
>> -        * vm_ops methods will be.
>> +        * When revoke_mappings no invocations of any method will be
>> +        * in progress.
>>         */
>>        struct vm_area_struct *vma;
>>        struct prio_tree_iter iter;
>>
>>        spin_lock(&mapping->i_mmap_lock);
>>
>> +       WARN_ON(mapping->revoke_task);
>> +       mapping->revoke_task = current;
>> +
>>  restart_tree:
>>        vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
>>                if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
>> @@ -164,6 +166,16 @@ restart_list:
>>                        goto restart_list;
>>        }
>>
>> +       while (!list_empty(&mapping->i_mmap_nonlinear) ||
>> +              !prio_tree_empty(&mapping->i_mmap) ||
>> +              mapping->close_count)
>> +       {
>> +               __set_current_state(TASK_UNINTERRUPTIBLE);
>> +               spin_unlock(&mapping->i_mmap_lock);
>> +               schedule();
>> +               spin_lock(&mapping->i_mmap_lock);
>> +       }
>> +       mapping->revoke_task = NULL;
>>        spin_unlock(&mapping->i_mmap_lock);
>>  }
>>  EXPORT_SYMBOL_GPL(revoke_mappings);
>> --
>> 1.7.2.3
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] Generic support for revoking mappings
  2010-09-25 23:33 ` Eric W. Biederman
@ 2010-09-27  8:52   ` CAI Qian
  -1 siblings, 0 replies; 28+ messages in thread
From: CAI Qian @ 2010-09-27  8:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel, Andrew Morton

Just a head up. Tried to boot latest mmotm kernel with those patches applied hit this. I am wondering what I did wrong.

Pid: 1, comm: init Not tainted 2.6.36-rc5-mm1+ #2 /KVM
RIP: 0010:[<ffffffff811d4c78>]  [<ffffffff811d4c78>] prio_tree_insert+0x188/0x2a0
RSP: 0018:ffff880c3b1bfcd8  EFLAGS: 00010202
RAX: ffff880c374b40d8 RBX: 0000000000000100 RCX: ffff880c374b40d8
RDX: 0000000000000179 RSI: 0000000000000000 RDI: 0000000000000179
RBP: ffff880c9f4ba188 R08: 0000000000000001 R09: ffff880c374b9330
R10: 0000000000000001 R11: 0000000000000002 R12: ffff880c374b40d8
R13: 00000007fa7367ba R14: 00000007fa7367be R15: 0000000000000000
FS:  00007fa7369d9700(0000) GS:ffff8800df540000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffc0 CR3: 0000000c374b1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process init (pid: 1, threadinfo ffff880c3b1be000, task ffff880c3b1bd400)
Stack:
 ffff880c3b1bd400 ffff880c374b4088 ffff880c374b40d8 ffff880c374b4088
<0> ffff880c9f4ba168 ffff880c9f4ba188 ffff880c374b3680 ffffffff810daff8
<0> 0000000000000002 ffff880c374b41f8 ffff880c374b42b0 ffffffff810e9171
Call Trace:
 [<ffffffff810daff8>] ? vma_prio_tree_insert+0x28/0x120
 [<ffffffff810e9171>] ? vma_adjust+0xe1/0x560
 [<ffffffff8119715b>] ? avc_has_perm+0x6b/0xa0
 [<ffffffff810e97b9>] ? __split_vma+0x1c9/0x250
 [<ffffffff810ebf88>] ? mprotect_fixup+0x708/0x7b0
 [<ffffffff810e4aca>] ? handle_mm_fault+0x1da/0xcf0
 [<ffffffff81033910>] ? pvclock_clocksource_read+0x50/0xc0
 [<ffffffff81047220>] ? __dequeue_entity+0x40/0x50
 [<ffffffff81198a31>] ? file_has_perm+0xf1/0x100
 [<ffffffff810ec1b2>] ? sys_mprotect+0x182/0x250
 [<ffffffff8100aec2>] ? system_call_fastpath+0x16/0x1b
Code: 56 20 e9 d4 fe ff ff bb 01 00 00 00 48 d3 e3 48 85 db 0f 84 08 01 00 00 45 31 ff 66 45 85 c0 4c 89 e1 74 78 0f 1f 80 00 00 00 00 <48> 8b 46 c0 48 2b 46 b8 4c 8b 6e 40 48 c1 e8 0c 4c 39 ef 4d 8d 
RIP  [<ffffffff811d4c78>] prio_tree_insert+0x188/0x2a0
 RSP <ffff880c3b1bfcd8>
CR2: ffffffffffffffc0
---[ end trace 667258bb79b38e02 ]---

----- "Eric W. Biederman" <ebiederm@xmission.com> wrote:

> During hotplug or rmmod if a device or file is mmaped, it's mapping
> needs to be removed and future access need to return SIGBUS almost
> like
> truncate.  This case is rare enough that it barely gets tested, and
> making a generic implementation warranted.  I have tried with sysfs
> but
> a complete and generic implementation does not seem possible without
> mm
> support.
> 
> It looks like a fully generic implementation with mm knowledge
> is shorter and easier to get right than what I have in sysfs today.
> 
> So here is that fully generic implementation.
> 
> Eric W. Biederman (3):
>       mm: Introduce revoke_mappings.
>       mm: Consolidate vma destruction into remove_vma.
>       mm: Cause revoke_mappings to wait until all close methods have
> completed.
> ---
>  include/linux/fs.h |    2 +
>  include/linux/mm.h |    2 +
>  mm/Makefile        |    2 +-
>  mm/mmap.c          |   34 +++++-----
>  mm/nommu.c         |    5 ++
>  mm/revoke.c        |  192
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 219 insertions(+), 18 deletions(-)
> 
> Eric
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] Generic support for revoking mappings
@ 2010-09-27  8:52   ` CAI Qian
  0 siblings, 0 replies; 28+ messages in thread
From: CAI Qian @ 2010-09-27  8:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Hugh Dickins, linux-mm, linux-kernel, Rik van Riel, Andrew Morton

Just a head up. Tried to boot latest mmotm kernel with those patches applied hit this. I am wondering what I did wrong.

Pid: 1, comm: init Not tainted 2.6.36-rc5-mm1+ #2 /KVM
RIP: 0010:[<ffffffff811d4c78>]  [<ffffffff811d4c78>] prio_tree_insert+0x188/0x2a0
RSP: 0018:ffff880c3b1bfcd8  EFLAGS: 00010202
RAX: ffff880c374b40d8 RBX: 0000000000000100 RCX: ffff880c374b40d8
RDX: 0000000000000179 RSI: 0000000000000000 RDI: 0000000000000179
RBP: ffff880c9f4ba188 R08: 0000000000000001 R09: ffff880c374b9330
R10: 0000000000000001 R11: 0000000000000002 R12: ffff880c374b40d8
R13: 00000007fa7367ba R14: 00000007fa7367be R15: 0000000000000000
FS:  00007fa7369d9700(0000) GS:ffff8800df540000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffc0 CR3: 0000000c374b1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process init (pid: 1, threadinfo ffff880c3b1be000, task ffff880c3b1bd400)
Stack:
 ffff880c3b1bd400 ffff880c374b4088 ffff880c374b40d8 ffff880c374b4088
<0> ffff880c9f4ba168 ffff880c9f4ba188 ffff880c374b3680 ffffffff810daff8
<0> 0000000000000002 ffff880c374b41f8 ffff880c374b42b0 ffffffff810e9171
Call Trace:
 [<ffffffff810daff8>] ? vma_prio_tree_insert+0x28/0x120
 [<ffffffff810e9171>] ? vma_adjust+0xe1/0x560
 [<ffffffff8119715b>] ? avc_has_perm+0x6b/0xa0
 [<ffffffff810e97b9>] ? __split_vma+0x1c9/0x250
 [<ffffffff810ebf88>] ? mprotect_fixup+0x708/0x7b0
 [<ffffffff810e4aca>] ? handle_mm_fault+0x1da/0xcf0
 [<ffffffff81033910>] ? pvclock_clocksource_read+0x50/0xc0
 [<ffffffff81047220>] ? __dequeue_entity+0x40/0x50
 [<ffffffff81198a31>] ? file_has_perm+0xf1/0x100
 [<ffffffff810ec1b2>] ? sys_mprotect+0x182/0x250
 [<ffffffff8100aec2>] ? system_call_fastpath+0x16/0x1b
Code: 56 20 e9 d4 fe ff ff bb 01 00 00 00 48 d3 e3 48 85 db 0f 84 08 01 00 00 45 31 ff 66 45 85 c0 4c 89 e1 74 78 0f 1f 80 00 00 00 00 <48> 8b 46 c0 48 2b 46 b8 4c 8b 6e 40 48 c1 e8 0c 4c 39 ef 4d 8d 
RIP  [<ffffffff811d4c78>] prio_tree_insert+0x188/0x2a0
 RSP <ffff880c3b1bfcd8>
CR2: ffffffffffffffc0
---[ end trace 667258bb79b38e02 ]---

----- "Eric W. Biederman" <ebiederm@xmission.com> wrote:

> During hotplug or rmmod if a device or file is mmaped, it's mapping
> needs to be removed and future access need to return SIGBUS almost
> like
> truncate.  This case is rare enough that it barely gets tested, and
> making a generic implementation warranted.  I have tried with sysfs
> but
> a complete and generic implementation does not seem possible without
> mm
> support.
> 
> It looks like a fully generic implementation with mm knowledge
> is shorter and easier to get right than what I have in sysfs today.
> 
> So here is that fully generic implementation.
> 
> Eric W. Biederman (3):
>       mm: Introduce revoke_mappings.
>       mm: Consolidate vma destruction into remove_vma.
>       mm: Cause revoke_mappings to wait until all close methods have
> completed.
> ---
>  include/linux/fs.h |    2 +
>  include/linux/mm.h |    2 +
>  mm/Makefile        |    2 +-
>  mm/mmap.c          |   34 +++++-----
>  mm/nommu.c         |    5 ++
>  mm/revoke.c        |  192
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 219 insertions(+), 18 deletions(-)
> 
> Eric
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] Generic support for revoking mappings
  2010-09-27  8:52   ` CAI Qian
@ 2010-09-27  9:12     ` Américo Wang
  -1 siblings, 0 replies; 28+ messages in thread
From: Américo Wang @ 2010-09-27  9:12 UTC (permalink / raw)
  To: CAI Qian
  Cc: Eric W. Biederman, Hugh Dickins, linux-mm, linux-kernel,
	Rik van Riel, Andrew Morton

On Mon, Sep 27, 2010 at 04:52:29AM -0400, CAI Qian wrote:
>Just a head up. Tried to boot latest mmotm kernel with those patches applied hit this. I am wondering what I did wrong.
>

You missed the header of this oops/warning/bug, is that a BUG_ON or WARN_ON or other thing?


>Pid: 1, comm: init Not tainted 2.6.36-rc5-mm1+ #2 /KVM
>RIP: 0010:[<ffffffff811d4c78>]  [<ffffffff811d4c78>] prio_tree_insert+0x188/0x2a0
>RSP: 0018:ffff880c3b1bfcd8  EFLAGS: 00010202
>RAX: ffff880c374b40d8 RBX: 0000000000000100 RCX: ffff880c374b40d8
>RDX: 0000000000000179 RSI: 0000000000000000 RDI: 0000000000000179
>RBP: ffff880c9f4ba188 R08: 0000000000000001 R09: ffff880c374b9330
>R10: 0000000000000001 R11: 0000000000000002 R12: ffff880c374b40d8
>R13: 00000007fa7367ba R14: 00000007fa7367be R15: 0000000000000000
>FS:  00007fa7369d9700(0000) GS:ffff8800df540000(0000) knlGS:0000000000000000
>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>CR2: ffffffffffffffc0 CR3: 0000000c374b1000 CR4: 00000000000006e0
>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>Process init (pid: 1, threadinfo ffff880c3b1be000, task ffff880c3b1bd400)
>Stack:
> ffff880c3b1bd400 ffff880c374b4088 ffff880c374b40d8 ffff880c374b4088
><0> ffff880c9f4ba168 ffff880c9f4ba188 ffff880c374b3680 ffffffff810daff8
><0> 0000000000000002 ffff880c374b41f8 ffff880c374b42b0 ffffffff810e9171
>Call Trace:
> [<ffffffff810daff8>] ? vma_prio_tree_insert+0x28/0x120
> [<ffffffff810e9171>] ? vma_adjust+0xe1/0x560
> [<ffffffff8119715b>] ? avc_has_perm+0x6b/0xa0
> [<ffffffff810e97b9>] ? __split_vma+0x1c9/0x250
> [<ffffffff810ebf88>] ? mprotect_fixup+0x708/0x7b0
> [<ffffffff810e4aca>] ? handle_mm_fault+0x1da/0xcf0
> [<ffffffff81033910>] ? pvclock_clocksource_read+0x50/0xc0
> [<ffffffff81047220>] ? __dequeue_entity+0x40/0x50
> [<ffffffff81198a31>] ? file_has_perm+0xf1/0x100
> [<ffffffff810ec1b2>] ? sys_mprotect+0x182/0x250
> [<ffffffff8100aec2>] ? system_call_fastpath+0x16/0x1b
>Code: 56 20 e9 d4 fe ff ff bb 01 00 00 00 48 d3 e3 48 85 db 0f 84 08 01 00 00 45 31 ff 66 45 85 c0 4c 89 e1 74 78 0f 1f 80 00 00 00 00 <48> 8b 46 c0 48 2b 46 b8 4c 8b 6e 40 48 c1 e8 0c 4c 39 ef 4d 8d 
>RIP  [<ffffffff811d4c78>] prio_tree_insert+0x188/0x2a0
> RSP <ffff880c3b1bfcd8>
>CR2: ffffffffffffffc0
>---[ end trace 667258bb79b38e02 ]---
>

Looks like something wrong in page fault.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/3] Generic support for revoking mappings
@ 2010-09-27  9:12     ` Américo Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Américo Wang @ 2010-09-27  9:12 UTC (permalink / raw)
  To: CAI Qian
  Cc: Eric W. Biederman, Hugh Dickins, linux-mm, linux-kernel,
	Rik van Riel, Andrew Morton

On Mon, Sep 27, 2010 at 04:52:29AM -0400, CAI Qian wrote:
>Just a head up. Tried to boot latest mmotm kernel with those patches applied hit this. I am wondering what I did wrong.
>

You missed the header of this oops/warning/bug, is that a BUG_ON or WARN_ON or other thing?


>Pid: 1, comm: init Not tainted 2.6.36-rc5-mm1+ #2 /KVM
>RIP: 0010:[<ffffffff811d4c78>]  [<ffffffff811d4c78>] prio_tree_insert+0x188/0x2a0
>RSP: 0018:ffff880c3b1bfcd8  EFLAGS: 00010202
>RAX: ffff880c374b40d8 RBX: 0000000000000100 RCX: ffff880c374b40d8
>RDX: 0000000000000179 RSI: 0000000000000000 RDI: 0000000000000179
>RBP: ffff880c9f4ba188 R08: 0000000000000001 R09: ffff880c374b9330
>R10: 0000000000000001 R11: 0000000000000002 R12: ffff880c374b40d8
>R13: 00000007fa7367ba R14: 00000007fa7367be R15: 0000000000000000
>FS:  00007fa7369d9700(0000) GS:ffff8800df540000(0000) knlGS:0000000000000000
>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>CR2: ffffffffffffffc0 CR3: 0000000c374b1000 CR4: 00000000000006e0
>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>Process init (pid: 1, threadinfo ffff880c3b1be000, task ffff880c3b1bd400)
>Stack:
> ffff880c3b1bd400 ffff880c374b4088 ffff880c374b40d8 ffff880c374b4088
><0> ffff880c9f4ba168 ffff880c9f4ba188 ffff880c374b3680 ffffffff810daff8
><0> 0000000000000002 ffff880c374b41f8 ffff880c374b42b0 ffffffff810e9171
>Call Trace:
> [<ffffffff810daff8>] ? vma_prio_tree_insert+0x28/0x120
> [<ffffffff810e9171>] ? vma_adjust+0xe1/0x560
> [<ffffffff8119715b>] ? avc_has_perm+0x6b/0xa0
> [<ffffffff810e97b9>] ? __split_vma+0x1c9/0x250
> [<ffffffff810ebf88>] ? mprotect_fixup+0x708/0x7b0
> [<ffffffff810e4aca>] ? handle_mm_fault+0x1da/0xcf0
> [<ffffffff81033910>] ? pvclock_clocksource_read+0x50/0xc0
> [<ffffffff81047220>] ? __dequeue_entity+0x40/0x50
> [<ffffffff81198a31>] ? file_has_perm+0xf1/0x100
> [<ffffffff810ec1b2>] ? sys_mprotect+0x182/0x250
> [<ffffffff8100aec2>] ? system_call_fastpath+0x16/0x1b
>Code: 56 20 e9 d4 fe ff ff bb 01 00 00 00 48 d3 e3 48 85 db 0f 84 08 01 00 00 45 31 ff 66 45 85 c0 4c 89 e1 74 78 0f 1f 80 00 00 00 00 <48> 8b 46 c0 48 2b 46 b8 4c 8b 6e 40 48 c1 e8 0c 4c 39 ef 4d 8d 
>RIP  [<ffffffff811d4c78>] prio_tree_insert+0x188/0x2a0
> RSP <ffff880c3b1bfcd8>
>CR2: ffffffffffffffc0
>---[ end trace 667258bb79b38e02 ]---
>

Looks like something wrong in page fault.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: Introduce revoke_mappings.
  2010-09-25 23:33   ` Eric W. Biederman
@ 2010-09-27 10:49     ` Pekka Enberg
  -1 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27 10:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

> Subject: [PATCH 1/3] mm: Introduce revoke_mappings.
>
> When the backing store of a file becomes inaccessible we need a function
> to remove that file from the page tables and arrange for page faults
> to trigger SIGBUS.
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>

> +static void revoke_vma(struct vm_area_struct *old)
> +{
> +	/* Atomically replace a vma with an identical one that returns
> +	 * VM_FAULT_SIGBUS to every mmap request.
> +	 *
> +	 * This function must be called with the mm->mmap semaphore held.
> +	 */
> +	unsigned long start, end, len, pgoff, vm_flags;
> +	struct vm_area_struct *new;
> +	struct mm_struct *mm;
> +	struct file *file;
> +
> +	file  = revoked_filp;
> +	mm    = old->vm_mm;
> +	start = old->vm_start;
> +	end   = old->vm_end;
> +	len   = end - start;
> +	pgoff = old->vm_pgoff;
> +
> +	/* Preserve user visble vm_flags. */
> +	vm_flags = VM_SHARED | VM_MAYSHARE | (old->vm_flags & REVOKED_VM_FLAGS);
> +
> +	/* If kmem_cache_zalloc fails return and ultimately try again */
> +	new = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> +	if (!new)
> +		goto out;
> +
> +	/* I am freeing exactly one vma so munmap should never fail.
> +	 * If munmap fails return and ultimately try again.
> +	 */
> +	if (unlikely(do_munmap(mm, start, len)))
> +		goto fail;
> +
> +	INIT_LIST_HEAD(&new->anon_vma_chain);
> +	new->vm_mm    = mm;
> +	new->vm_start = start;
> +	new->vm_end   = end;
> +	new->vm_flags = vm_flags;
> +	new->vm_page_prot = vm_get_page_prot(vm_flags);
> +	new->vm_pgoff = pgoff;
> +	new->vm_file  = file;
> +	get_file(file);
> +	new->vm_ops   = &revoked_vm_ops;
> +
> +	/* Since the area was just umapped there is no excuse for
> +	 * insert_vm_struct to fail.
> +	 *
> +	 * If insert_vm_struct fails we will cause a SIGSEGV instead
> +	 * a SIGBUS.  A shame but not the end of the world.

Can we simply fix up the old vma to avoid kmem_cache_zalloc() and
insert_vm_struct altogether? We're protected by ->mmap_sem so that shouldn't be
a problem?

> +	 */
> +	if (unlikely(insert_vm_struct(mm, new)))
> +		goto fail;
> +
> +	mm->total_vm += len >> PAGE_SHIFT;
> +
> +	perf_event_mmap(new);
> +
> +	return;
> +fail:
> +	kmem_cache_free(vm_area_cachep, new);
> +	WARN_ONCE(1, "%s failed\n", __func__);

Why don't we just propagate errors such as -ENOMEM to the callers? It seems
pointless to try to retry the operation at this level.

> +out:
> +	return;
> +}
> +
> +static bool revoke_mapping(struct address_space *mapping, struct mm_struct *mm,
> +			   unsigned long addr)
> +{
> +	/* Returns true if the locks were dropped */
> +	struct vm_area_struct *vma;
> +
> +	/*
> +	 * Drop i_mmap_lock and grab the mm sempahore so I can call

s/sempahore/semaphore/

> +	 * revoke_vma.
> +	 */
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return false;
> +	spin_unlock(&mapping->i_mmap_lock);
> +	down_write(&mm->mmap_sem);
> +
> +	/* There was a vma at mm, addr that needed to be revoked.
> +	 * Look and see if there is still a vma there that needs
> +	 * to be revoked.
> +	 */
> +	vma = find_vma(mm, addr);

Why aren't we checking for NULL vma here? AFAICT, there's a tiny window between
dropping ->i_mmap_lock and grabbing ->mmap_sem where the vma might have been
unmapped.

> +	if (vma->vm_file->f_mapping == mapping)
> +		revoke_vma(vma);
> +
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);
> +	spin_lock(&mapping->i_mmap_lock);
> +	return true;
> +}
> +
> +void revoke_mappings(struct address_space *mapping)
> +{
> +	/* Make any access to previously mapped pages trigger a SIGBUS,
> +	 * and stop calling vm_ops methods.
> +	 *
> +	 * When revoke_mappings returns invocations of vm_ops->close
> +	 * may still be in progress, but no invocations of any other
> +	 * vm_ops methods will be.
> +	 */
> +	struct vm_area_struct *vma;
> +	struct prio_tree_iter iter;
> +
> +	spin_lock(&mapping->i_mmap_lock);
> +
> +restart_tree:
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
> +		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
> +			goto restart_tree;
> +	}
> +
> +restart_list:
> +	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
> +		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
> +			goto restart_list;
> +	}
> +

What prevents a process from remapping the file after we've done revoking the
vma prio tree? Shouldn't we always restart from the top?

Also, don't we need spin_needbreak() on ->i_mmap_lock and cond_resched()
somewhere here like we do in mm/memory.c, for example?

> +	spin_unlock(&mapping->i_mmap_lock);
> +}
> +EXPORT_SYMBOL_GPL(revoke_mappings);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: Introduce revoke_mappings.
@ 2010-09-27 10:49     ` Pekka Enberg
  0 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-09-27 10:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

> Subject: [PATCH 1/3] mm: Introduce revoke_mappings.
>
> When the backing store of a file becomes inaccessible we need a function
> to remove that file from the page tables and arrange for page faults
> to trigger SIGBUS.
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>

> +static void revoke_vma(struct vm_area_struct *old)
> +{
> +	/* Atomically replace a vma with an identical one that returns
> +	 * VM_FAULT_SIGBUS to every mmap request.
> +	 *
> +	 * This function must be called with the mm->mmap semaphore held.
> +	 */
> +	unsigned long start, end, len, pgoff, vm_flags;
> +	struct vm_area_struct *new;
> +	struct mm_struct *mm;
> +	struct file *file;
> +
> +	file  = revoked_filp;
> +	mm    = old->vm_mm;
> +	start = old->vm_start;
> +	end   = old->vm_end;
> +	len   = end - start;
> +	pgoff = old->vm_pgoff;
> +
> +	/* Preserve user visble vm_flags. */
> +	vm_flags = VM_SHARED | VM_MAYSHARE | (old->vm_flags & REVOKED_VM_FLAGS);
> +
> +	/* If kmem_cache_zalloc fails return and ultimately try again */
> +	new = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> +	if (!new)
> +		goto out;
> +
> +	/* I am freeing exactly one vma so munmap should never fail.
> +	 * If munmap fails return and ultimately try again.
> +	 */
> +	if (unlikely(do_munmap(mm, start, len)))
> +		goto fail;
> +
> +	INIT_LIST_HEAD(&new->anon_vma_chain);
> +	new->vm_mm    = mm;
> +	new->vm_start = start;
> +	new->vm_end   = end;
> +	new->vm_flags = vm_flags;
> +	new->vm_page_prot = vm_get_page_prot(vm_flags);
> +	new->vm_pgoff = pgoff;
> +	new->vm_file  = file;
> +	get_file(file);
> +	new->vm_ops   = &revoked_vm_ops;
> +
> +	/* Since the area was just umapped there is no excuse for
> +	 * insert_vm_struct to fail.
> +	 *
> +	 * If insert_vm_struct fails we will cause a SIGSEGV instead
> +	 * a SIGBUS.  A shame but not the end of the world.

Can we simply fix up the old vma to avoid kmem_cache_zalloc() and
insert_vm_struct altogether? We're protected by ->mmap_sem so that shouldn't be
a problem?

> +	 */
> +	if (unlikely(insert_vm_struct(mm, new)))
> +		goto fail;
> +
> +	mm->total_vm += len >> PAGE_SHIFT;
> +
> +	perf_event_mmap(new);
> +
> +	return;
> +fail:
> +	kmem_cache_free(vm_area_cachep, new);
> +	WARN_ONCE(1, "%s failed\n", __func__);

Why don't we just propagate errors such as -ENOMEM to the callers? It seems
pointless to try to retry the operation at this level.

> +out:
> +	return;
> +}
> +
> +static bool revoke_mapping(struct address_space *mapping, struct mm_struct *mm,
> +			   unsigned long addr)
> +{
> +	/* Returns true if the locks were dropped */
> +	struct vm_area_struct *vma;
> +
> +	/*
> +	 * Drop i_mmap_lock and grab the mm sempahore so I can call

s/sempahore/semaphore/

> +	 * revoke_vma.
> +	 */
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return false;
> +	spin_unlock(&mapping->i_mmap_lock);
> +	down_write(&mm->mmap_sem);
> +
> +	/* There was a vma at mm, addr that needed to be revoked.
> +	 * Look and see if there is still a vma there that needs
> +	 * to be revoked.
> +	 */
> +	vma = find_vma(mm, addr);

Why aren't we checking for NULL vma here? AFAICT, there's a tiny window between
dropping ->i_mmap_lock and grabbing ->mmap_sem where the vma might have been
unmapped.

> +	if (vma->vm_file->f_mapping == mapping)
> +		revoke_vma(vma);
> +
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);
> +	spin_lock(&mapping->i_mmap_lock);
> +	return true;
> +}
> +
> +void revoke_mappings(struct address_space *mapping)
> +{
> +	/* Make any access to previously mapped pages trigger a SIGBUS,
> +	 * and stop calling vm_ops methods.
> +	 *
> +	 * When revoke_mappings returns invocations of vm_ops->close
> +	 * may still be in progress, but no invocations of any other
> +	 * vm_ops methods will be.
> +	 */
> +	struct vm_area_struct *vma;
> +	struct prio_tree_iter iter;
> +
> +	spin_lock(&mapping->i_mmap_lock);
> +
> +restart_tree:
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
> +		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
> +			goto restart_tree;
> +	}
> +
> +restart_list:
> +	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
> +		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
> +			goto restart_list;
> +	}
> +

What prevents a process from remapping the file after we've done revoking the
vma prio tree? Shouldn't we always restart from the top?

Also, don't we need spin_needbreak() on ->i_mmap_lock and cond_resched()
somewhere here like we do in mm/memory.c, for example?

> +	spin_unlock(&mapping->i_mmap_lock);
> +}
> +EXPORT_SYMBOL_GPL(revoke_mappings);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: Introduce revoke_mappings.
  2010-09-27 10:49     ` Pekka Enberg
@ 2010-09-27 14:23       ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-27 14:23 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

Pekka Enberg <penberg@kernel.org> writes:

>> Subject: [PATCH 1/3] mm: Introduce revoke_mappings.
>>
>> When the backing store of a file becomes inaccessible we need a function
>> to remove that file from the page tables and arrange for page faults
>> to trigger SIGBUS.
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>
>> +static void revoke_vma(struct vm_area_struct *old)
>> +{
>> +	/* Atomically replace a vma with an identical one that returns
>> +	 * VM_FAULT_SIGBUS to every mmap request.
>> +	 *
>> +	 * This function must be called with the mm->mmap semaphore held.
>> +	 */
>> +	unsigned long start, end, len, pgoff, vm_flags;
>> +	struct vm_area_struct *new;
>> +	struct mm_struct *mm;
>> +	struct file *file;
>> +
>> +	file  = revoked_filp;
>> +	mm    = old->vm_mm;
>> +	start = old->vm_start;
>> +	end   = old->vm_end;
>> +	len   = end - start;
>> +	pgoff = old->vm_pgoff;
>> +
>> +	/* Preserve user visble vm_flags. */
>> +	vm_flags = VM_SHARED | VM_MAYSHARE | (old->vm_flags & REVOKED_VM_FLAGS);
>> +
>> +	/* If kmem_cache_zalloc fails return and ultimately try again */
>> +	new = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
>> +	if (!new)
>> +		goto out;
>> +
>> +	/* I am freeing exactly one vma so munmap should never fail.
>> +	 * If munmap fails return and ultimately try again.
>> +	 */
>> +	if (unlikely(do_munmap(mm, start, len)))
>> +		goto fail;
>> +
>> +	INIT_LIST_HEAD(&new->anon_vma_chain);
>> +	new->vm_mm    = mm;
>> +	new->vm_start = start;
>> +	new->vm_end   = end;
>> +	new->vm_flags = vm_flags;
>> +	new->vm_page_prot = vm_get_page_prot(vm_flags);
>> +	new->vm_pgoff = pgoff;
>> +	new->vm_file  = file;
>> +	get_file(file);
>> +	new->vm_ops   = &revoked_vm_ops;
>> +
>> +	/* Since the area was just umapped there is no excuse for
>> +	 * insert_vm_struct to fail.
>> +	 *
>> +	 * If insert_vm_struct fails we will cause a SIGSEGV instead
>> +	 * a SIGBUS.  A shame but not the end of the world.
>
> Can we simply fix up the old vma to avoid kmem_cache_zalloc() and
> insert_vm_struct altogether? We're protected by ->mmap_sem so that shouldn't be
> a problem?

So far this looks far more obvious, and easier to maintain than simply
reusing the vma.  Certainly I would want to factor out a
remove_vm_struct of or something similar if we go down that direction.

Simply reusing the vm struct requires being wise in the subtle and
twisted ways of the linux mm subsystem.  I tried that, and my patch
was sufficiently non-obvious that it was completely non-trivial to
review.

Using do_munmap and insert_vm_struct is a bit tedious but it is straight
forward.  Given how excessively complicated the mm is today with an
excess of flags, data structures and magic list order removals to
prevent truncate races it seems best to just avoid that mess as much as
possible.

>> +	 */
>> +	if (unlikely(insert_vm_struct(mm, new)))
>> +		goto fail;
>> +
>> +	mm->total_vm += len >> PAGE_SHIFT;
>> +
>> +	perf_event_mmap(new);
>> +
>> +	return;
>> +fail:
>> +	kmem_cache_free(vm_area_cachep, new);
>> +	WARN_ONCE(1, "%s failed\n", __func__);
>
> Why don't we just propagate errors such as -ENOMEM to the callers? It seems
> pointless to try to retry the operation at this level.

The two errors that come here are of the kind that should never happen.
That is do_munmap and insert_vm_struct with the arguments I am giving
can not fail today.  But since the routines have the potential to
return an error I am handling that, and screaming.

I am not currently propagating errors because what can be done with an
error?  The hardware that I am working on has just been removed.  I must
remove it's mappings.  About the only legitimate thing I could do with
an out of memory error here is to trigger the OOM killer.  Which is to
say while sys_revoke can use this mechanism.  My primary target is
hotplug removal of drivers.

>> +out:
>> +	return;
>> +}
>> +
>> +static bool revoke_mapping(struct address_space *mapping, struct mm_struct *mm,
>> +			   unsigned long addr)
>> +{
>> +	/* Returns true if the locks were dropped */
>> +	struct vm_area_struct *vma;
>> +
>> +	/*
>> +	 * Drop i_mmap_lock and grab the mm sempahore so I can call
>
> s/sempahore/semaphore/

Thanks.
>
>> +	 * revoke_vma.
>> +	 */
>> +	if (!atomic_inc_not_zero(&mm->mm_users))
>> +		return false;
>> +	spin_unlock(&mapping->i_mmap_lock);
>> +	down_write(&mm->mmap_sem);
>> +
>> +	/* There was a vma at mm, addr that needed to be revoked.
>> +	 * Look and see if there is still a vma there that needs
>> +	 * to be revoked.
>> +	 */
>> +	vma = find_vma(mm, addr);
>
> Why aren't we checking for NULL vma here? AFAICT, there's a tiny window between
> dropping ->i_mmap_lock and grabbing ->mmap_sem where the vma might have been
> unmapped.

Good catch.  I'm certain I checked for that in at least one version.
Brain fart.

>> +	if (vma->vm_file->f_mapping == mapping)
>> +		revoke_vma(vma);
>> +
>> +	up_write(&mm->mmap_sem);
>> +	mmput(mm);
>> +	spin_lock(&mapping->i_mmap_lock);
>> +	return true;
>> +}
>> +
>> +void revoke_mappings(struct address_space *mapping)
>> +{
>> +	/* Make any access to previously mapped pages trigger a SIGBUS,
>> +	 * and stop calling vm_ops methods.
>> +	 *
>> +	 * When revoke_mappings returns invocations of vm_ops->close
>> +	 * may still be in progress, but no invocations of any other
>> +	 * vm_ops methods will be.
>> +	 */
>> +	struct vm_area_struct *vma;
>> +	struct prio_tree_iter iter;
>> +
>> +	spin_lock(&mapping->i_mmap_lock);
>> +
>> +restart_tree:
>> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
>> +		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
>> +			goto restart_tree;
>> +	}
>> +
>> +restart_list:
>> +	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
>> +		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
>> +			goto restart_list;
>> +	}
>> +
>
> What prevents a process from remapping the file after we've done revoking the
> vma prio tree? Shouldn't we always restart from the top?

Synchronization at the caller.  I am assuming that mmap is already
refusing to create new mappings when this code is called.  Throwing
away the mappings simply does not make sense if someone can be creating
more at the same time.

Look at where unmap_mapping_range is called in sysfs or in my most
recent uio patches.  That is where I expect to be calling
revoke_mappings when the dust clears.

I'm certain I had a comment to that effect once...

> Also, don't we need spin_needbreak() on ->i_mmap_lock and cond_resched()
> somewhere here like we do in mm/memory.c, for example?

Why?  Whenever we actually do something we drop the lock, and call
blocking operations so the lock hold times should be quite short.

>> +	spin_unlock(&mapping->i_mmap_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(revoke_mappings);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: Introduce revoke_mappings.
@ 2010-09-27 14:23       ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2010-09-27 14:23 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins

Pekka Enberg <penberg@kernel.org> writes:

>> Subject: [PATCH 1/3] mm: Introduce revoke_mappings.
>>
>> When the backing store of a file becomes inaccessible we need a function
>> to remove that file from the page tables and arrange for page faults
>> to trigger SIGBUS.
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>
>> +static void revoke_vma(struct vm_area_struct *old)
>> +{
>> +	/* Atomically replace a vma with an identical one that returns
>> +	 * VM_FAULT_SIGBUS to every mmap request.
>> +	 *
>> +	 * This function must be called with the mm->mmap semaphore held.
>> +	 */
>> +	unsigned long start, end, len, pgoff, vm_flags;
>> +	struct vm_area_struct *new;
>> +	struct mm_struct *mm;
>> +	struct file *file;
>> +
>> +	file  = revoked_filp;
>> +	mm    = old->vm_mm;
>> +	start = old->vm_start;
>> +	end   = old->vm_end;
>> +	len   = end - start;
>> +	pgoff = old->vm_pgoff;
>> +
>> +	/* Preserve user visble vm_flags. */
>> +	vm_flags = VM_SHARED | VM_MAYSHARE | (old->vm_flags & REVOKED_VM_FLAGS);
>> +
>> +	/* If kmem_cache_zalloc fails return and ultimately try again */
>> +	new = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
>> +	if (!new)
>> +		goto out;
>> +
>> +	/* I am freeing exactly one vma so munmap should never fail.
>> +	 * If munmap fails return and ultimately try again.
>> +	 */
>> +	if (unlikely(do_munmap(mm, start, len)))
>> +		goto fail;
>> +
>> +	INIT_LIST_HEAD(&new->anon_vma_chain);
>> +	new->vm_mm    = mm;
>> +	new->vm_start = start;
>> +	new->vm_end   = end;
>> +	new->vm_flags = vm_flags;
>> +	new->vm_page_prot = vm_get_page_prot(vm_flags);
>> +	new->vm_pgoff = pgoff;
>> +	new->vm_file  = file;
>> +	get_file(file);
>> +	new->vm_ops   = &revoked_vm_ops;
>> +
>> +	/* Since the area was just umapped there is no excuse for
>> +	 * insert_vm_struct to fail.
>> +	 *
>> +	 * If insert_vm_struct fails we will cause a SIGSEGV instead
>> +	 * a SIGBUS.  A shame but not the end of the world.
>
> Can we simply fix up the old vma to avoid kmem_cache_zalloc() and
> insert_vm_struct altogether? We're protected by ->mmap_sem so that shouldn't be
> a problem?

So far this looks far more obvious, and easier to maintain than simply
reusing the vma.  Certainly I would want to factor out a
remove_vm_struct of or something similar if we go down that direction.

Simply reusing the vm struct requires being wise in the subtle and
twisted ways of the linux mm subsystem.  I tried that, and my patch
was sufficiently non-obvious that it was completely non-trivial to
review.

Using do_munmap and insert_vm_struct is a bit tedious but it is straight
forward.  Given how excessively complicated the mm is today with an
excess of flags, data structures and magic list order removals to
prevent truncate races it seems best to just avoid that mess as much as
possible.

>> +	 */
>> +	if (unlikely(insert_vm_struct(mm, new)))
>> +		goto fail;
>> +
>> +	mm->total_vm += len >> PAGE_SHIFT;
>> +
>> +	perf_event_mmap(new);
>> +
>> +	return;
>> +fail:
>> +	kmem_cache_free(vm_area_cachep, new);
>> +	WARN_ONCE(1, "%s failed\n", __func__);
>
> Why don't we just propagate errors such as -ENOMEM to the callers? It seems
> pointless to try to retry the operation at this level.

The two errors that come here are of the kind that should never happen.
That is do_munmap and insert_vm_struct with the arguments I am giving
can not fail today.  But since the routines have the potential to
return an error I am handling that, and screaming.

I am not currently propagating errors because what can be done with an
error?  The hardware that I am working on has just been removed.  I must
remove it's mappings.  About the only legitimate thing I could do with
an out of memory error here is to trigger the OOM killer.  Which is to
say while sys_revoke can use this mechanism.  My primary target is
hotplug removal of drivers.

>> +out:
>> +	return;
>> +}
>> +
>> +static bool revoke_mapping(struct address_space *mapping, struct mm_struct *mm,
>> +			   unsigned long addr)
>> +{
>> +	/* Returns true if the locks were dropped */
>> +	struct vm_area_struct *vma;
>> +
>> +	/*
>> +	 * Drop i_mmap_lock and grab the mm sempahore so I can call
>
> s/sempahore/semaphore/

Thanks.
>
>> +	 * revoke_vma.
>> +	 */
>> +	if (!atomic_inc_not_zero(&mm->mm_users))
>> +		return false;
>> +	spin_unlock(&mapping->i_mmap_lock);
>> +	down_write(&mm->mmap_sem);
>> +
>> +	/* There was a vma at mm, addr that needed to be revoked.
>> +	 * Look and see if there is still a vma there that needs
>> +	 * to be revoked.
>> +	 */
>> +	vma = find_vma(mm, addr);
>
> Why aren't we checking for NULL vma here? AFAICT, there's a tiny window between
> dropping ->i_mmap_lock and grabbing ->mmap_sem where the vma might have been
> unmapped.

Good catch.  I'm certain I checked for that in at least one version.
Brain fart.

>> +	if (vma->vm_file->f_mapping == mapping)
>> +		revoke_vma(vma);
>> +
>> +	up_write(&mm->mmap_sem);
>> +	mmput(mm);
>> +	spin_lock(&mapping->i_mmap_lock);
>> +	return true;
>> +}
>> +
>> +void revoke_mappings(struct address_space *mapping)
>> +{
>> +	/* Make any access to previously mapped pages trigger a SIGBUS,
>> +	 * and stop calling vm_ops methods.
>> +	 *
>> +	 * When revoke_mappings returns invocations of vm_ops->close
>> +	 * may still be in progress, but no invocations of any other
>> +	 * vm_ops methods will be.
>> +	 */
>> +	struct vm_area_struct *vma;
>> +	struct prio_tree_iter iter;
>> +
>> +	spin_lock(&mapping->i_mmap_lock);
>> +
>> +restart_tree:
>> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
>> +		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
>> +			goto restart_tree;
>> +	}
>> +
>> +restart_list:
>> +	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
>> +		if (revoke_mapping(mapping, vma->vm_mm, vma->vm_start))
>> +			goto restart_list;
>> +	}
>> +
>
> What prevents a process from remapping the file after we've done revoking the
> vma prio tree? Shouldn't we always restart from the top?

Synchronization at the caller.  I am assuming that mmap is already
refusing to create new mappings when this code is called.  Throwing
away the mappings simply does not make sense if someone can be creating
more at the same time.

Look at where unmap_mapping_range is called in sysfs or in my most
recent uio patches.  That is where I expect to be calling
revoke_mappings when the dust clears.

I'm certain I had a comment to that effect once...

> Also, don't we need spin_needbreak() on ->i_mmap_lock and cond_resched()
> somewhere here like we do in mm/memory.c, for example?

Why?  Whenever we actually do something we drop the lock, and call
blocking operations so the lock hold times should be quite short.

>> +	spin_unlock(&mapping->i_mmap_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(revoke_mappings);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-09-27 14:23 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-25 23:33 [PATCH 0/3] Generic support for revoking mappings Eric W. Biederman
2010-09-25 23:33 ` Eric W. Biederman
2010-09-25 23:33 ` [PATCH 1/3] mm: Introduce revoke_mappings Eric W. Biederman
2010-09-25 23:33   ` Eric W. Biederman
2010-09-27 10:49   ` Pekka Enberg
2010-09-27 10:49     ` Pekka Enberg
2010-09-27 14:23     ` Eric W. Biederman
2010-09-27 14:23       ` Eric W. Biederman
2010-09-25 23:34 ` [PATCH 2/3] mm: Consolidate vma destruction into remove_vma Eric W. Biederman
2010-09-25 23:34   ` Eric W. Biederman
2010-09-27  6:37   ` Pekka Enberg
2010-09-27  6:37     ` Pekka Enberg
2010-09-27  6:39     ` Pekka Enberg
2010-09-27  6:39       ` Pekka Enberg
2010-09-27  6:44       ` Eric W. Biederman
2010-09-27  6:44         ` Eric W. Biederman
2010-09-27  7:11         ` Pekka Enberg
2010-09-27  7:11           ` Pekka Enberg
2010-09-25 23:34 ` [PATCH 3/3] mm: Cause revoke_mappings to wait until all close methods have completed Eric W. Biederman
2010-09-25 23:34   ` Eric W. Biederman
2010-09-27  7:15   ` Pekka Enberg
2010-09-27  7:15     ` Pekka Enberg
2010-09-27  8:04     ` Eric W. Biederman
2010-09-27  8:04       ` Eric W. Biederman
2010-09-27  8:52 ` [PATCH 0/3] Generic support for revoking mappings CAI Qian
2010-09-27  8:52   ` CAI Qian
2010-09-27  9:12   ` Américo Wang
2010-09-27  9:12     ` Américo Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.