linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: i_mmap_mutex to rwsem
@ 2014-05-23  3:33 Davidlohr Bueso
  2014-05-23  3:33 ` [PATCH 1/5] mm,fs: introduce helpers around i_mmap_mutex Davidlohr Bueso
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Davidlohr Bueso @ 2014-05-23  3:33 UTC (permalink / raw)
  To: akpm
  Cc: mingo, peterz, riel, mgorman, davidlohr, aswin, linux-mm,
	linux-fsdevel, linux-kernel

This patchset extends the work started by Ingo Molnar in late 2012,
optimizing the anon-vma mutex lock, converting it from a exclusive mutex
to a rwsem, and sharing the lock for read-only paths when walking the
the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.

The i_mmap_mutex has similar responsibilities with the anon-vma, protecting
file backed pages. Therefore we can use similar locking techniques: covert
the mutex to a rwsem and share the lock when possible.

With the new optimistic spinning property we have in rwsems, we no longer
take a hit in performance when using this lock, and we can therefore
safely do the conversion. Tests show no throughput regressions in aim7 or
pgbench runs, and we can see gains from sharing the lock, in disk workloads
~+15% for over 1000 users on a 8-socket Westmere system.

This patchset applies on linux-next-20140522.

Thanks!

Davidlohr Bueso (5):
  mm,fs: introduce helpers around i_mmap_mutex
  mm: use new helper functions around the i_mmap_mutex
  mm: convert i_mmap_mutex to rwsem
  mm/rmap: share the i_mmap_rwsem
  mm: rename leftover i_mmap_mutex

 fs/hugetlbfs/inode.c         | 14 +++++++-------
 fs/inode.c                   |  2 +-
 include/linux/fs.h           | 23 ++++++++++++++++++++++-
 include/linux/mmu_notifier.h |  2 +-
 kernel/events/uprobes.c      |  6 +++---
 kernel/fork.c                |  4 ++--
 mm/filemap.c                 | 10 +++++-----
 mm/filemap_xip.c             |  4 ++--
 mm/hugetlb.c                 | 22 +++++++++++-----------
 mm/memory-failure.c          |  4 ++--
 mm/memory.c                  |  8 ++++----
 mm/mmap.c                    | 22 +++++++++++-----------
 mm/mremap.c                  |  6 +++---
 mm/nommu.c                   | 14 +++++++-------
 mm/rmap.c                    | 10 +++++-----
 15 files changed, 86 insertions(+), 65 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] mm,fs: introduce helpers around i_mmap_mutex
  2014-05-23  3:33 [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
@ 2014-05-23  3:33 ` Davidlohr Bueso
  2014-05-23 17:16   ` Rik van Riel
  2014-05-23  3:33 ` [PATCH 2/5] mm: use new helper functions around the i_mmap_mutex Davidlohr Bueso
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2014-05-23  3:33 UTC (permalink / raw)
  To: akpm
  Cc: mingo, peterz, riel, mgorman, davidlohr, aswin, linux-mm,
	linux-fsdevel, linux-kernel

Various parts of the kernel acquire and release this mutex,
so add i_mmap_lock_write() and immap_unlock_write() helper
functions that will encapsulate this logic. The next patch
will make use of these.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/fs.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1cab2f8..524d2c1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -456,6 +456,16 @@ struct block_device {
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
+static inline void i_mmap_lock_write(struct address_space *mapping)
+{
+	mutex_lock(&mapping->i_mmap_mutex);
+}
+
+static inline void i_mmap_unlock_write(struct address_space *mapping)
+{
+	mutex_unlock(&mapping->i_mmap_mutex);
+}
+
 /*
  * Might pages of this file be mapped into userspace?
  */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] mm: use new helper functions around the i_mmap_mutex
  2014-05-23  3:33 [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
  2014-05-23  3:33 ` [PATCH 1/5] mm,fs: introduce helpers around i_mmap_mutex Davidlohr Bueso
@ 2014-05-23  3:33 ` Davidlohr Bueso
  2014-05-23 17:16   ` Rik van Riel
  2014-05-23  3:33 ` [PATCH 3/5] mm: convert i_mmap_mutex to rwsem Davidlohr Bueso
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2014-05-23  3:33 UTC (permalink / raw)
  To: akpm
  Cc: mingo, peterz, riel, mgorman, davidlohr, aswin, linux-mm,
	linux-fsdevel, linux-kernel

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 fs/hugetlbfs/inode.c    |  4 ++--
 kernel/events/uprobes.c |  4 ++--
 kernel/fork.c           |  4 ++--
 mm/filemap_xip.c        |  4 ++--
 mm/hugetlb.c            | 12 ++++++------
 mm/memory-failure.c     |  4 ++--
 mm/memory.c             |  8 ++++----
 mm/mmap.c               | 14 +++++++-------
 mm/mremap.c             |  4 ++--
 mm/nommu.c              | 14 +++++++-------
 mm/rmap.c               |  4 ++--
 11 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 93fc531..bcaf4df 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -412,10 +412,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	mutex_lock(&mapping->i_mmap_mutex);
+	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	mutex_unlock(&mapping->i_mmap_mutex);
+	i_mmap_unlock_write(mapping);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 7716c40..ccf793c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -722,7 +722,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 	int more = 0;
 
  again:
-	mutex_lock(&mapping->i_mmap_mutex);
+	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		if (!valid_vma(vma, is_register))
 			continue;
@@ -753,7 +753,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 		info->mm = vma->vm_mm;
 		info->vaddr = offset_to_vaddr(vma, offset);
 	}
-	mutex_unlock(&mapping->i_mmap_mutex);
+	i_mmap_unlock_write(mapping);
 
 	if (!more)
 		goto out;
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1..76b1483 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -419,7 +419,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			mutex_lock(&mapping->i_mmap_mutex);
+			i_mmap_lock_write(mapping);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping->i_mmap_writable++;
 			flush_dcache_mmap_lock(mapping);
@@ -431,7 +431,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 				vma_interval_tree_insert_after(tmp, mpnt,
 							&mapping->i_mmap);
 			flush_dcache_mmap_unlock(mapping);
-			mutex_unlock(&mapping->i_mmap_mutex);
+			i_mmap_unlock_write(mapping);
 		}
 
 		/*
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..bad746b 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapping,
 		return;
 
 retry:
-	mutex_lock(&mapping->i_mmap_mutex);
+	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -202,7 +202,7 @@ retry:
 			page_cache_release(page);
 		}
 	}
-	mutex_unlock(&mapping->i_mmap_mutex);
+	i_mmap_unlock_write(mapping);
 
 	if (locked) {
 		mutex_unlock(&xip_sparse_mutex);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9519cde..161f98d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2761,7 +2761,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * this mapping should be shared between all the VMAs,
 	 * __unmap_hugepage_range() is called as the lock is already held
 	 */
-	mutex_lock(&mapping->i_mmap_mutex);
+	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) {
 		/* Do not unmap the current VMA */
 		if (iter_vma == vma)
@@ -2778,7 +2778,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 			unmap_hugepage_range(iter_vma, address,
 					     address + huge_page_size(h), page);
 	}
-	mutex_unlock(&mapping->i_mmap_mutex);
+	i_mmap_unlock_write(mapping);
 
 	return 1;
 }
@@ -3342,7 +3342,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_cache_range(vma, address, end);
 
 	mmu_notifier_invalidate_range_start(mm, start, end);
-	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
 		ptep = huge_pte_offset(mm, address);
@@ -3370,7 +3370,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * and that page table be reused and filled with junk.
 	 */
 	flush_tlb_range(vma, start, end);
-	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+	i_mmap_unlock_write(vma->vm_file->f_mapping);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 
 	return pages << h->order;
@@ -3538,7 +3538,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
 
-	mutex_lock(&mapping->i_mmap_mutex);
+	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -3566,7 +3566,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
-	mutex_unlock(&mapping->i_mmap_mutex);
+	i_mmap_unlock_write(mapping);
 	return pte;
 }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e3154d9..1389a28 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -434,7 +434,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	struct task_struct *tsk;
 	struct address_space *mapping = page->mapping;
 
-	mutex_lock(&mapping->i_mmap_mutex);
+	i_mmap_lock_write(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
 		pgoff_t pgoff = page_pgoff(page);
@@ -456,7 +456,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 		}
 	}
 	read_unlock(&tasklist_lock);
-	mutex_unlock(&mapping->i_mmap_mutex);
+	i_mmap_unlock_write(mapping);
 }
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index e9a3c0f..815bcd9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1340,9 +1340,9 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 * safe to do nothing in this case.
 			 */
 			if (vma->vm_file) {
-				mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+				i_mmap_lock_write(vma->vm_file->f_mapping);
 				__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
-				mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+				i_mmap_unlock_write(vma->vm_file->f_mapping);
 			}
 		} else
 			unmap_page_range(tlb, vma, start, end, details);
@@ -2387,12 +2387,12 @@ void unmap_mapping_range(struct address_space *mapping,
 		details.last_index = ULONG_MAX;
 
 
-	mutex_lock(&mapping->i_mmap_mutex);
+	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	mutex_unlock(&mapping->i_mmap_mutex);
+	i_mmap_unlock_write(mapping);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index da3c212..41a0083 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -236,9 +236,9 @@ void unlink_file_vma(struct vm_area_struct *vma)
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		mutex_lock(&mapping->i_mmap_mutex);
+		i_mmap_lock_write(mapping);
 		__remove_shared_vm_struct(vma, file, mapping);
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_write(mapping);
 	}
 }
 
@@ -645,14 +645,14 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
-		mutex_lock(&mapping->i_mmap_mutex);
+		i_mmap_lock_write(mapping);
 	}
 
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
 
 	if (mapping)
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_write(mapping);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -764,7 +764,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 							next->vm_end);
 		}
 
-		mutex_lock(&mapping->i_mmap_mutex);
+		i_mmap_lock_write(mapping);
 		if (insert) {
 			/*
 			 * Put into interval tree now, so instantiated pages
@@ -851,7 +851,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		anon_vma_unlock_write(anon_vma);
 	}
 	if (mapping)
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_write(mapping);
 
 	if (root) {
 		uprobe_mmap(vma);
@@ -3174,7 +3174,7 @@ static void vm_unlock_mapping(struct address_space *mapping)
 		 * AS_MM_ALL_LOCKS can't change to 0 from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_write(mapping);
 		if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
 					&mapping->flags))
 			BUG();
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..60259a2 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -119,7 +119,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	if (need_rmap_locks) {
 		if (vma->vm_file) {
 			mapping = vma->vm_file->f_mapping;
-			mutex_lock(&mapping->i_mmap_mutex);
+			i_mmap_lock_write(mapping);
 		}
 		if (vma->anon_vma) {
 			anon_vma = vma->anon_vma;
@@ -156,7 +156,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	if (anon_vma)
 		anon_vma_unlock_write(anon_vma);
 	if (mapping)
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_write(mapping);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
diff --git a/mm/nommu.c b/mm/nommu.c
index e6ced9d..52047c4 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -722,11 +722,11 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
 
-		mutex_lock(&mapping->i_mmap_mutex);
+		i_mmap_lock_write(mapping);
 		flush_dcache_mmap_lock(mapping);
 		vma_interval_tree_insert(vma, &mapping->i_mmap);
 		flush_dcache_mmap_unlock(mapping);
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_write(mapping);
 	}
 
 	/* add the VMA to the tree */
@@ -795,11 +795,11 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
 
-		mutex_lock(&mapping->i_mmap_mutex);
+		i_mmap_lock_write(mapping);
 		flush_dcache_mmap_lock(mapping);
 		vma_interval_tree_remove(vma, &mapping->i_mmap);
 		flush_dcache_mmap_unlock(mapping);
-		mutex_unlock(&mapping->i_mmap_mutex);
+		i_mmap_unlock_write(mapping);
 	}
 
 	/* remove from the MM's tree and list */
@@ -2091,14 +2091,14 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 	high = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
 	down_write(&nommu_region_sem);
-	mutex_lock(&inode->i_mapping->i_mmap_mutex);
+	i_mmap_lock_write(inode->i_mapping);
 
 	/* search for VMAs that fall within the dead zone */
 	vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) {
 		/* found one - only interested if it's shared out of the page
 		 * cache */
 		if (vma->vm_flags & VM_SHARED) {
-			mutex_unlock(&inode->i_mapping->i_mmap_mutex);
+			i_mmap_unlock_write(inode->i_mapping);
 			up_write(&nommu_region_sem);
 			return -ETXTBSY; /* not quite true, but near enough */
 		}
@@ -2126,7 +2126,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 		}
 	}
 
-	mutex_unlock(&inode->i_mapping->i_mmap_mutex);
+	i_mmap_unlock_write(inode->i_mapping);
 	up_write(&nommu_region_sem);
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 3333baa..9a56e4f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1685,7 +1685,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 
 	if (!mapping)
 		return ret;
-	mutex_lock(&mapping->i_mmap_mutex);
+	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 
@@ -1708,7 +1708,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 	ret = rwc->file_nonlinear(page, mapping, rwc->arg);
 
 done:
-	mutex_unlock(&mapping->i_mmap_mutex);
+	i_mmap_unlock_write(mapping);
 	return ret;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] mm: convert i_mmap_mutex to rwsem
  2014-05-23  3:33 [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
  2014-05-23  3:33 ` [PATCH 1/5] mm,fs: introduce helpers around i_mmap_mutex Davidlohr Bueso
  2014-05-23  3:33 ` [PATCH 2/5] mm: use new helper functions around the i_mmap_mutex Davidlohr Bueso
@ 2014-05-23  3:33 ` Davidlohr Bueso
  2014-05-23 17:33   ` Rik van Riel
  2014-05-23  3:33 ` [PATCH 4/5] mm/rmap: share the i_mmap_rwsem Davidlohr Bueso
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2014-05-23  3:33 UTC (permalink / raw)
  To: akpm
  Cc: mingo, peterz, riel, mgorman, davidlohr, aswin, linux-mm,
	linux-fsdevel, linux-kernel

The i_mmap_mutex is a close cousin of the anon vma lock,
both protecting similar data, one for file backed pages
and the other for anon memory. To this end, this lock can
also be a rwsem.

This conversion is straightforward. For now, all users take
the write lock.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 fs/hugetlbfs/inode.c | 10 +++++-----
 fs/inode.c           |  2 +-
 include/linux/fs.h   |  7 ++++---
 mm/mmap.c            |  8 ++++----
 4 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index bcaf4df..020ace5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -472,12 +472,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
 }
 
 /*
- * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
+ * Hugetlbfs is not reclaimable; therefore its i_mmap_rwsem will never
  * be taken from reclaim -- unlike regular filesystems. This needs an
  * annotation because huge_pmd_share() does an allocation under
- * i_mmap_mutex.
+ * i_mmap_rwsem.
  */
-static struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
 
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					struct inode *dir,
@@ -495,8 +495,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		struct hugetlbfs_inode_info *info;
 		inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
-		lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
-				&hugetlbfs_i_mmap_mutex_key);
+		lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
+				&hugetlbfs_i_mmap_rwsem_key);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
diff --git a/fs/inode.c b/fs/inode.c
index 2feb9b6..d26e9f8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -348,7 +348,7 @@ void address_space_init_once(struct address_space *mapping)
 	memset(mapping, 0, sizeof(*mapping));
 	INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
 	spin_lock_init(&mapping->tree_lock);
-	mutex_init(&mapping->i_mmap_mutex);
+	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 	mapping->i_mmap = RB_ROOT;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 524d2c1..60a1d7d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -18,6 +18,7 @@
 #include <linux/pid.h>
 #include <linux/bug.h>
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
@@ -390,7 +391,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct rb_root		i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
+	struct rw_semaphore	i_mmap_rwsem;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
@@ -458,12 +459,12 @@ int mapping_tagged(struct address_space *mapping, int tag);
 
 static inline void i_mmap_lock_write(struct address_space *mapping)
 {
-	mutex_lock(&mapping->i_mmap_mutex);
+	down_write(&mapping->i_mmap_rwsem);
 }
 
 static inline void i_mmap_unlock_write(struct address_space *mapping)
 {
-	mutex_unlock(&mapping->i_mmap_mutex);
+	up_write(&mapping->i_mmap_rwsem);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
index 41a0083..bc7a3b2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -208,7 +208,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_mutex
+ * Requires inode->i_mapping->i_mmap_rwsem
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -2814,7 +2814,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_mutex is taken here.
+ * then i_mmap_rwsem is taken here.
  */
 int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 {
@@ -3078,7 +3078,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		mutex_lock_nest_lock(&mapping->i_mmap_mutex, &mm->mmap_sem);
+		down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_sem);
 	}
 }
 
@@ -3105,7 +3105,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
  * vma in this mm is backed by the same anon_vma or address_space.
  *
  * We can take all the locks in random order because the VM code
- * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never
+ * taking i_mmap_rwsem or anon_vma->rwsem outside the mmap_sem never
  * takes more than one of them in a row. Secondly we're protected
  * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex.
  *
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] mm/rmap: share the i_mmap_rwsem
  2014-05-23  3:33 [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2014-05-23  3:33 ` [PATCH 3/5] mm: convert i_mmap_mutex to rwsem Davidlohr Bueso
@ 2014-05-23  3:33 ` Davidlohr Bueso
  2014-05-23 18:35   ` Rik van Riel
  2014-05-26 19:35   ` Hugh Dickins
  2014-05-23  3:33 ` [PATCH 5/5] mm: rename leftover i_mmap_mutex Davidlohr Bueso
  2014-05-30  2:20 ` [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
  5 siblings, 2 replies; 18+ messages in thread
From: Davidlohr Bueso @ 2014-05-23  3:33 UTC (permalink / raw)
  To: akpm
  Cc: mingo, peterz, riel, mgorman, davidlohr, aswin, linux-mm,
	linux-fsdevel, linux-kernel

Similarly to rmap_walk_anon() and collect_procs_anon(),
there is opportunity to share the lock in rmap_walk_file()
and collect_procs_file() for file backed pages.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/fs.h  | 10 ++++++++++
 mm/memory-failure.c |  4 ++--
 mm/rmap.c           |  4 ++--
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 60a1d7d..4c2c228 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -467,6 +467,16 @@ static inline void i_mmap_unlock_write(struct address_space *mapping)
 	up_write(&mapping->i_mmap_rwsem);
 }
 
+static inline void i_mmap_lock_read(struct address_space *mapping)
+{
+	down_read(&mapping->i_mmap_rwsem);
+}
+
+static inline void i_mmap_unlock_read(struct address_space *mapping)
+{
+	up_read(&mapping->i_mmap_rwsem);
+}
+
 /*
  * Might pages of this file be mapped into userspace?
  */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1389a28..acbcd8e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -434,7 +434,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	struct task_struct *tsk;
 	struct address_space *mapping = page->mapping;
 
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
 		pgoff_t pgoff = page_pgoff(page);
@@ -456,7 +456,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 		}
 	}
 	read_unlock(&tasklist_lock);
-	i_mmap_unlock_write(mapping);
+	i_mmap_unlock_read(mapping);
 }
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 9a56e4f..5841dcb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1685,7 +1685,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 
 	if (!mapping)
 		return ret;
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 
@@ -1708,7 +1708,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 	ret = rwc->file_nonlinear(page, mapping, rwc->arg);
 
 done:
-	i_mmap_unlock_write(mapping);
+	i_mmap_unlock_read(mapping);
 	return ret;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] mm: rename leftover i_mmap_mutex
  2014-05-23  3:33 [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2014-05-23  3:33 ` [PATCH 4/5] mm/rmap: share the i_mmap_rwsem Davidlohr Bueso
@ 2014-05-23  3:33 ` Davidlohr Bueso
  2014-05-23 18:36   ` Rik van Riel
  2014-05-30  2:20 ` [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
  5 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2014-05-23  3:33 UTC (permalink / raw)
  To: akpm
  Cc: mingo, peterz, riel, mgorman, davidlohr, aswin, linux-mm,
	linux-fsdevel, linux-kernel

Update the lock to i_mmap_rwsem throughout the kernel.
All changes are in comments and documentation.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
---
 include/linux/mmu_notifier.h |  2 +-
 kernel/events/uprobes.c      |  2 +-
 mm/filemap.c                 | 10 +++++-----
 mm/hugetlb.c                 | 10 +++++-----
 mm/mremap.c                  |  2 +-
 mm/rmap.c                    |  6 +++---
 6 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..f9c11ab 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -151,7 +151,7 @@ struct mmu_notifier_ops {
  * Therefore notifier chains can only be traversed when either
  *
  * 1. mmap_sem is held.
- * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem).
+ * 2. One of the reverse map locks is held (i_mmap_rwsem or anon_vma->rwsem).
  * 3. No other concurrent thread can access the list (release)
  */
 struct mmu_notifier {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ccf793c..e809554 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -729,7 +729,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
 
 		if (!prev && !more) {
 			/*
-			 * Needs GFP_NOWAIT to avoid i_mmap_mutex recursion through
+			 * Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through
 			 * reclaim. This is optimistic, no harm done if it fails.
 			 */
 			prev = kmalloc(sizeof(struct map_info),
diff --git a/mm/filemap.c b/mm/filemap.c
index 263cffe..aadf613 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -61,16 +61,16 @@
 /*
  * Lock ordering:
  *
- *  ->i_mmap_mutex		(truncate_pagecache)
+ *  ->i_mmap_rwsem		(truncate_pagecache)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
  *  ->i_mutex
- *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
+ *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_sem
- *    ->i_mmap_mutex
+ *    ->i_mmap_rwsem
  *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
  *        ->mapping->tree_lock	(arch-dependent flush_dcache_mmap_lock)
  *
@@ -84,7 +84,7 @@
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
- *  ->i_mmap_mutex
+ *  ->i_mmap_rwsem
  *    ->anon_vma.lock		(vma_adjust)
  *
  *  ->anon_vma.lock
@@ -104,7 +104,7 @@
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
- * ->i_mmap_mutex
+ * ->i_mmap_rwsem
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 161f98d..3ab3ffd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2713,9 +2713,9 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 	 * on its way out.  We're lucky that the flag has such an appropriate
 	 * name, and can in fact be safely cleared here. We could clear it
 	 * before the __unmap_hugepage_range above, but all that's necessary
-	 * is to clear it before releasing the i_mmap_mutex. This works
+	 * is to clear it before releasing the i_mmap_rwsem. This works
 	 * because in the context this is called, the VMA is about to be
-	 * destroyed and the i_mmap_mutex is held.
+	 * destroyed and the i_mmap_rwsem is held.
 	 */
 	vma->vm_flags &= ~VM_MAYSHARE;
 }
@@ -3364,9 +3364,9 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 	}
 	/*
-	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
+	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
-	 * once we release i_mmap_mutex, another task can do the final put_page
+	 * once we release i_mmap_rwsem, another task can do the final put_page
 	 * and that page table be reused and filled with junk.
 	 */
 	flush_tlb_range(vma, start, end);
@@ -3519,7 +3519,7 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
  * and returns the corresponding pte. While this is not necessary for the
  * !shared pmd case because we can allocate the pmd later as well, it makes the
  * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_mutex section - otherwise
+ * pud has to be populated inside the same i_mmap_rwsem section - otherwise
  * racing tasks could either miss the sharing (see huge_pte_offset) or select a
  * bad pmd for sharing.
  */
diff --git a/mm/mremap.c b/mm/mremap.c
index 60259a2..3f4e487 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -99,7 +99,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	spinlock_t *old_ptl, *new_ptl;
 
 	/*
-	 * When need_rmap_locks is true, we take the i_mmap_mutex and anon_vma
+	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
 	 * locks to ensure that rmap will always observe either the old or the
 	 * new ptes. This is the easiest way to avoid races with
 	 * truncate_pagecache(), page migration, etc...
diff --git a/mm/rmap.c b/mm/rmap.c
index 5841dcb..56346c9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,7 +23,7 @@
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
- *       mapping->i_mmap_mutex
+ *       mapping->i_mmap_rwsem
  *         anon_vma->rwsem
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -1258,7 +1258,7 @@ out_mlock:
 	/*
 	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
 	 * unstable result and race. Plus, We can't wait here because
-	 * we now hold anon_vma->rwsem or mapping->i_mmap_mutex.
+	 * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem.
 	 * if trylock failed, the page remain in evictable lru and later
 	 * vmscan could retry to move the page to unevictable lru if the
 	 * page is actually mlocked.
@@ -1679,7 +1679,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 	 * The page lock not only makes sure that page->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the
 	 * structure at mapping cannot be freed and reused yet,
-	 * so we can safely take mapping->i_mmap_mutex.
+	 * so we can safely take mapping->i_mmap_rwsem.
 	 */
 	VM_BUG_ON(!PageLocked(page));
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] mm,fs: introduce helpers around i_mmap_mutex
  2014-05-23  3:33 ` [PATCH 1/5] mm,fs: introduce helpers around i_mmap_mutex Davidlohr Bueso
@ 2014-05-23 17:16   ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-05-23 17:16 UTC (permalink / raw)
  To: Davidlohr Bueso, akpm
  Cc: mingo, peterz, mgorman, aswin, linux-mm, linux-fsdevel, linux-kernel

On 05/22/2014 11:33 PM, Davidlohr Bueso wrote:
> Various parts of the kernel acquire and release this mutex,
> so add i_mmap_lock_write() and immap_unlock_write() helper
> functions that will encapsulate this logic. The next patch
> will make use of these.
> 
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] mm: use new helper functions around the i_mmap_mutex
  2014-05-23  3:33 ` [PATCH 2/5] mm: use new helper functions around the i_mmap_mutex Davidlohr Bueso
@ 2014-05-23 17:16   ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-05-23 17:16 UTC (permalink / raw)
  To: Davidlohr Bueso, akpm
  Cc: mingo, peterz, mgorman, aswin, linux-mm, linux-fsdevel, linux-kernel

On 05/22/2014 11:33 PM, Davidlohr Bueso wrote:
> Convert all open coded mutex_lock/unlock calls to the
> i_mmap_[lock/unlock]_write() helpers.
> 
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] mm: convert i_mmap_mutex to rwsem
  2014-05-23  3:33 ` [PATCH 3/5] mm: convert i_mmap_mutex to rwsem Davidlohr Bueso
@ 2014-05-23 17:33   ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-05-23 17:33 UTC (permalink / raw)
  To: Davidlohr Bueso, akpm
  Cc: mingo, peterz, mgorman, aswin, linux-mm, linux-fsdevel, linux-kernel

On 05/22/2014 11:33 PM, Davidlohr Bueso wrote:
> The i_mmap_mutex is a close cousin of the anon vma lock,
> both protecting similar data, one for file backed pages
> and the other for anon memory. To this end, this lock can
> also be a rwsem.
> 
> This conversion is straightforward. For now, all users take
> the write lock.
> 
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] mm/rmap: share the i_mmap_rwsem
  2014-05-23  3:33 ` [PATCH 4/5] mm/rmap: share the i_mmap_rwsem Davidlohr Bueso
@ 2014-05-23 18:35   ` Rik van Riel
  2014-05-26 19:35   ` Hugh Dickins
  1 sibling, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-05-23 18:35 UTC (permalink / raw)
  To: Davidlohr Bueso, akpm
  Cc: mingo, peterz, mgorman, aswin, linux-mm, linux-fsdevel, linux-kernel

On 05/22/2014 11:33 PM, Davidlohr Bueso wrote:
> Similarly to rmap_walk_anon() and collect_procs_anon(),
> there is opportunity to share the lock in rmap_walk_file()
> and collect_procs_file() for file backed pages.
> 
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] mm: rename leftover i_mmap_mutex
  2014-05-23  3:33 ` [PATCH 5/5] mm: rename leftover i_mmap_mutex Davidlohr Bueso
@ 2014-05-23 18:36   ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2014-05-23 18:36 UTC (permalink / raw)
  To: Davidlohr Bueso, akpm
  Cc: mingo, peterz, mgorman, aswin, linux-mm, linux-fsdevel, linux-kernel

On 05/22/2014 11:33 PM, Davidlohr Bueso wrote:
> Update the lock to i_mmap_rwsem throughout the kernel.
> All changes are in comments and documentation.
> 
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] mm/rmap: share the i_mmap_rwsem
  2014-05-23  3:33 ` [PATCH 4/5] mm/rmap: share the i_mmap_rwsem Davidlohr Bueso
  2014-05-23 18:35   ` Rik van Riel
@ 2014-05-26 19:35   ` Hugh Dickins
  2014-05-26 20:48     ` Davidlohr Bueso
  1 sibling, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2014-05-26 19:35 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, mingo, peterz, riel, mgorman, aswin, linux-mm,
	linux-fsdevel, linux-kernel

On Thu, 22 May 2014, Davidlohr Bueso wrote:

> Similarly to rmap_walk_anon() and collect_procs_anon(),
> there is opportunity to share the lock in rmap_walk_file()
> and collect_procs_file() for file backed pages.

And lots of other places, no?  I welcome i_mmap_rwsem, but I think
you're approaching it wrongly to separate this off from 2/5, then
follow anon_vma for the places that can be converted to lock_read().

If you go back through 2/5 and study the context of each, I think
you'll find most make no modification to the tree, and can well
use the lock_read() rather than the lock_write().

I could be wrong, but I don't think there are any hidden gotchas.
There certainly are in the anon_vma case (where THP makes special
use of the anon_vma lock), and used to be in the i_mmap_lock case
(when invalidation had to be single-threaded across cond_rescheds),
but I think i_mmap_rwsem should be straightforward.

Sure, it's safe to use the lock_write() variant, but please don't
prefer it to lock_read() without good reason.

Hugh

> 
> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
> ---
>  include/linux/fs.h  | 10 ++++++++++
>  mm/memory-failure.c |  4 ++--
>  mm/rmap.c           |  4 ++--
>  3 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 60a1d7d..4c2c228 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -467,6 +467,16 @@ static inline void i_mmap_unlock_write(struct address_space *mapping)
>  	up_write(&mapping->i_mmap_rwsem);
>  }
>  
> +static inline void i_mmap_lock_read(struct address_space *mapping)
> +{
> +	down_read(&mapping->i_mmap_rwsem);
> +}
> +
> +static inline void i_mmap_unlock_read(struct address_space *mapping)
> +{
> +	up_read(&mapping->i_mmap_rwsem);
> +}
> +
>  /*
>   * Might pages of this file be mapped into userspace?
>   */
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1389a28..acbcd8e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -434,7 +434,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>  	struct task_struct *tsk;
>  	struct address_space *mapping = page->mapping;
>  
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_read(mapping);
>  	read_lock(&tasklist_lock);
>  	for_each_process(tsk) {
>  		pgoff_t pgoff = page_pgoff(page);
> @@ -456,7 +456,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>  		}
>  	}
>  	read_unlock(&tasklist_lock);
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_read(mapping);
>  }
>  
>  /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9a56e4f..5841dcb 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1685,7 +1685,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
>  
>  	if (!mapping)
>  		return ret;
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>  		unsigned long address = vma_address(page, vma);
>  
> @@ -1708,7 +1708,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
>  	ret = rwc->file_nonlinear(page, mapping, rwc->arg);
>  
>  done:
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_read(mapping);
>  	return ret;
>  }
>  
> -- 
> 1.8.1.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] mm/rmap: share the i_mmap_rwsem
  2014-05-26 19:35   ` Hugh Dickins
@ 2014-05-26 20:48     ` Davidlohr Bueso
  0 siblings, 0 replies; 18+ messages in thread
From: Davidlohr Bueso @ 2014-05-26 20:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mingo, peterz, riel, mgorman, aswin, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, 2014-05-26 at 12:35 -0700, Hugh Dickins wrote:
> On Thu, 22 May 2014, Davidlohr Bueso wrote:
> 
> > Similarly to rmap_walk_anon() and collect_procs_anon(),
> > there is opportunity to share the lock in rmap_walk_file()
> > and collect_procs_file() for file backed pages.
> 
> And lots of other places, no?  I welcome i_mmap_rwsem, but I think
> you're approaching it wrongly to separate this off from 2/5, then
> follow anon_vma for the places that can be converted to lock_read().

Sure, but as you can imagine, the reasoning behind it is simplicity and
bisectability. 2/5 is easy to commit typo-like errors, and end up
locking instead of unlocking and vice versa. I ran into a few while
testing and wanted to make life easier for reviewers.

> If you go back through 2/5 and study the context of each, I think
> you'll find most make no modification to the tree, and can well
> use the lock_read() rather than the lock_write().

I was planning on revisiting some of that. I have no concrete examples
yet, but I agree, there could very well be further opportunity to share
the lock in read-only paths. This 4/5 is just the first, and most
obvious, step towards improving the usage of the i_mmap lock.

> I could be wrong, but I don't think there are any hidden gotchas.
> There certainly are in the anon_vma case (where THP makes special
> use of the anon_vma lock), and used to be in the i_mmap_lock case
> (when invalidation had to be single-threaded across cond_rescheds),
> but I think i_mmap_rwsem should be straightforward.
> 
> Sure, it's safe to use the lock_write() variant, but please don't
> prefer it to lock_read() without good reason.

I will dig deeper (probably for 3.17 now), but I really believe this is
the correct way of splitting the patches for this particular series.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] mm: i_mmap_mutex to rwsem
  2014-05-23  3:33 [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2014-05-23  3:33 ` [PATCH 5/5] mm: rename leftover i_mmap_mutex Davidlohr Bueso
@ 2014-05-30  2:20 ` Davidlohr Bueso
  2014-06-02 20:08   ` Andrew Morton
  5 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2014-05-30  2:20 UTC (permalink / raw)
  To: akpm
  Cc: mingo, peterz, riel, mgorman, aswin, linux-mm, linux-fsdevel,
	linux-kernel

ping? Andrew any chance of getting this in -next?

On Thu, 2014-05-22 at 20:33 -0700, Davidlohr Bueso wrote:
> This patchset extends the work started by Ingo Molnar in late 2012,
> optimizing the anon-vma mutex lock, converting it from a exclusive mutex
> to a rwsem, and sharing the lock for read-only paths when walking the
> the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.
> 
> The i_mmap_mutex has similar responsibilities with the anon-vma, protecting
> file backed pages. Therefore we can use similar locking techniques: covert
> the mutex to a rwsem and share the lock when possible.
> 
> With the new optimistic spinning property we have in rwsems, we no longer
> take a hit in performance when using this lock, and we can therefore
> safely do the conversion. Tests show no throughput regressions in aim7 or
> pgbench runs, and we can see gains from sharing the lock, in disk workloads
> ~+15% for over 1000 users on a 8-socket Westmere system.
> 
> This patchset applies on linux-next-20140522.
> 
> Thanks!
> 
> Davidlohr Bueso (5):
>   mm,fs: introduce helpers around i_mmap_mutex
>   mm: use new helper functions around the i_mmap_mutex
>   mm: convert i_mmap_mutex to rwsem
>   mm/rmap: share the i_mmap_rwsem
>   mm: rename leftover i_mmap_mutex
> 
>  fs/hugetlbfs/inode.c         | 14 +++++++-------
>  fs/inode.c                   |  2 +-
>  include/linux/fs.h           | 23 ++++++++++++++++++++++-
>  include/linux/mmu_notifier.h |  2 +-
>  kernel/events/uprobes.c      |  6 +++---
>  kernel/fork.c                |  4 ++--
>  mm/filemap.c                 | 10 +++++-----
>  mm/filemap_xip.c             |  4 ++--
>  mm/hugetlb.c                 | 22 +++++++++++-----------
>  mm/memory-failure.c          |  4 ++--
>  mm/memory.c                  |  8 ++++----
>  mm/mmap.c                    | 22 +++++++++++-----------
>  mm/mremap.c                  |  6 +++---
>  mm/nommu.c                   | 14 +++++++-------
>  mm/rmap.c                    | 10 +++++-----
>  15 files changed, 86 insertions(+), 65 deletions(-)
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] mm: i_mmap_mutex to rwsem
  2014-05-30  2:20 ` [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
@ 2014-06-02 20:08   ` Andrew Morton
  2014-06-02 20:31     ` Davidlohr Bueso
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2014-06-02 20:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, riel, mgorman, aswin, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, 29 May 2014 19:20:15 -0700 Davidlohr Bueso <davidlohr@hp.com> wrote:

> On Thu, 2014-05-22 at 20:33 -0700, Davidlohr Bueso wrote:
> > This patchset extends the work started by Ingo Molnar in late 2012,
> > optimizing the anon-vma mutex lock, converting it from a exclusive mutex
> > to a rwsem, and sharing the lock for read-only paths when walking the
> > the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.
> > 
> > The i_mmap_mutex has similar responsibilities with the anon-vma, protecting
> > file backed pages. Therefore we can use similar locking techniques: covert
> > the mutex to a rwsem and share the lock when possible.
> > 
> > With the new optimistic spinning property we have in rwsems, we no longer
> > take a hit in performance when using this lock, and we can therefore
> > safely do the conversion. Tests show no throughput regressions in aim7 or
> > pgbench runs, and we can see gains from sharing the lock, in disk workloads
> > ~+15% for over 1000 users on a 8-socket Westmere system.
> > 
> > This patchset applies on linux-next-20140522.
>
> ping? Andrew any chance of getting this in -next?

(top-posting repaired)

It was a bit late for 3.16 back on May 26, when you said "I will dig
deeper (probably for 3.17 now)".  So, please take another look at the
patch factoring and let's get this underway for -rc1.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] mm: i_mmap_mutex to rwsem
  2014-06-02 20:08   ` Andrew Morton
@ 2014-06-02 20:31     ` Davidlohr Bueso
  2014-06-02 23:54       ` Hugh Dickins
  0 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2014-06-02 20:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, peterz, riel, mgorman, aswin, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, 2014-06-02 at 13:08 -0700, Andrew Morton wrote:
> On Thu, 29 May 2014 19:20:15 -0700 Davidlohr Bueso <davidlohr@hp.com> wrote:
> 
> > On Thu, 2014-05-22 at 20:33 -0700, Davidlohr Bueso wrote:
> > > This patchset extends the work started by Ingo Molnar in late 2012,
> > > optimizing the anon-vma mutex lock, converting it from a exclusive mutex
> > > to a rwsem, and sharing the lock for read-only paths when walking the
> > > the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.
> > > 
> > > The i_mmap_mutex has similar responsibilities with the anon-vma, protecting
> > > file backed pages. Therefore we can use similar locking techniques: covert
> > > the mutex to a rwsem and share the lock when possible.
> > > 
> > > With the new optimistic spinning property we have in rwsems, we no longer
> > > take a hit in performance when using this lock, and we can therefore
> > > safely do the conversion. Tests show no throughput regressions in aim7 or
> > > pgbench runs, and we can see gains from sharing the lock, in disk workloads
> > > ~+15% for over 1000 users on a 8-socket Westmere system.
> > > 
> > > This patchset applies on linux-next-20140522.
> >
> > ping? Andrew any chance of getting this in -next?
> 
> (top-posting repaired)
> 
> It was a bit late for 3.16 back on May 26, when you said "I will dig
> deeper (probably for 3.17 now)".  So, please take another look at the
> patch factoring and let's get this underway for -rc1.

Ok, so I meant that I'd dig deeper for the additional sharing
opportunities (which I've found a few as Hugh correctly suggested). So
those eventual patches could come later. 

But I see no reason for *this* patchset to be delayed, as even if it
gets to be 3.17 material, I'd still very much want to have the same
patch factoring I have now. I think its the correct way to handle lock
transitioning for both correctness and bisectability.

Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] mm: i_mmap_mutex to rwsem
  2014-06-02 20:31     ` Davidlohr Bueso
@ 2014-06-02 23:54       ` Hugh Dickins
  0 siblings, 0 replies; 18+ messages in thread
From: Hugh Dickins @ 2014-06-02 23:54 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, mingo, peterz, riel, mgorman, aswin, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, 2 Jun 2014, Davidlohr Bueso wrote:
> On Mon, 2014-06-02 at 13:08 -0700, Andrew Morton wrote:
> > On Thu, 29 May 2014 19:20:15 -0700 Davidlohr Bueso <davidlohr@hp.com> wrote:
> > 
> > > On Thu, 2014-05-22 at 20:33 -0700, Davidlohr Bueso wrote:
> > > > This patchset extends the work started by Ingo Molnar in late 2012,
> > > > optimizing the anon-vma mutex lock, converting it from a exclusive mutex
> > > > to a rwsem, and sharing the lock for read-only paths when walking the
> > > > the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.
> > > > 
> > > > The i_mmap_mutex has similar responsibilities with the anon-vma, protecting
> > > > file backed pages. Therefore we can use similar locking techniques: covert
> > > > the mutex to a rwsem and share the lock when possible.
> > > > 
> > > > With the new optimistic spinning property we have in rwsems, we no longer
> > > > take a hit in performance when using this lock, and we can therefore
> > > > safely do the conversion. Tests show no throughput regressions in aim7 or
> > > > pgbench runs, and we can see gains from sharing the lock, in disk workloads
> > > > ~+15% for over 1000 users on a 8-socket Westmere system.
> > > > 
> > > > This patchset applies on linux-next-20140522.
> > >
> > > ping? Andrew any chance of getting this in -next?
> > 
> > (top-posting repaired)
> > 
> > It was a bit late for 3.16 back on May 26, when you said "I will dig
> > deeper (probably for 3.17 now)".  So, please take another look at the
> > patch factoring and let's get this underway for -rc1.
> 
> Ok, so I meant that I'd dig deeper for the additional sharing
> opportunities (which I've found a few as Hugh correctly suggested). So
> those eventual patches could come later. 
> 
> But I see no reason for *this* patchset to be delayed, as even if it
> gets to be 3.17 material, I'd still very much want to have the same
> patch factoring I have now. I think its the correct way to handle lock
> transitioning for both correctness and bisectability.

I'd be glad to see it go into 3.16 if it works as well as advertized.
And if you're attached to your current 2/5, fine, do stick with that.
But please do a proper job on your 3/5, instead of just aping how the
anon case worked out.

Hugh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 0/5] mm: i_mmap_mutex to rwsem
@ 2013-06-25  0:21 Davidlohr Bueso
  0 siblings, 0 replies; 18+ messages in thread
From: Davidlohr Bueso @ 2013-06-25  0:21 UTC (permalink / raw)
  To: mingo, akpm
  Cc: walken, alex.shi, tim.c.chen, a.p.zijlstra, riel, peter,
	linux-kernel, linux-mm, Davidlohr Bueso

This patchset extends the work started by Ingo Molnar in late 2012,
optimizing the anon-vma mutex lock, converting it from a exclusive mutex
to a rwsem, and sharing the lock for read-only paths when walking the
the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.

The i_mmap mutex has similar responsibilities with the anon-vma, protecting
file backed pages. Therefore we can use similar locking techniques: covert
the mutex to a rwsem and share the lock when possible.

With these changes, and the rwsem optimizations discussed in
http://lkml.org/lkml/2013/6/16/38 we can see performance improvements.
For instance, on a 8 socket, 80 core DL980, when compared to a vanilla 3.10-rc5, 
aim7 benefits in throughput, with the following workloads (beyond 500 users):

- alltests (+14.5%)
- custom (+17%)
- disk (+11%)
- high_systime (+5%)
- shared (+15%)
- short (+4%)

For lower amounts of users, there are no significant differences as all numbers
are within the 0-2% noise range.

Davidlohr Bueso (5):
  mm,fs: introduce helpers around i_mmap_mutex
  mm: use new helper functions around the i_mmap_mutex
  mm: convert i_mmap_mutex to rwsem
  mm/rmap: share the i_mmap_rwsem
  mm: rename leftover i_mmap_mutex

 Documentation/lockstat.txt   |  2 +-
 Documentation/vm/locking     |  2 +-
 arch/x86/mm/hugetlbpage.c    |  6 +++---
 fs/hugetlbfs/inode.c         |  4 ++--
 fs/inode.c                   |  2 +-
 include/linux/fs.h           | 22 +++++++++++++++++++++-
 include/linux/mmu_notifier.h |  2 +-
 kernel/events/uprobes.c      |  6 +++---
 kernel/fork.c                |  4 ++--
 mm/filemap.c                 | 10 +++++-----
 mm/filemap_xip.c             |  4 ++--
 mm/fremap.c                  |  4 ++--
 mm/hugetlb.c                 | 16 ++++++++--------
 mm/memory-failure.c          |  7 +++----
 mm/memory.c                  |  8 ++++----
 mm/mmap.c                    | 22 +++++++++++-----------
 mm/mremap.c                  |  6 +++---
 mm/nommu.c                   | 14 +++++++-------
 mm/rmap.c                    | 24 ++++++++++++------------
 19 files changed, 92 insertions(+), 73 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-06-02 23:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-23  3:33 [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
2014-05-23  3:33 ` [PATCH 1/5] mm,fs: introduce helpers around i_mmap_mutex Davidlohr Bueso
2014-05-23 17:16   ` Rik van Riel
2014-05-23  3:33 ` [PATCH 2/5] mm: use new helper functions around the i_mmap_mutex Davidlohr Bueso
2014-05-23 17:16   ` Rik van Riel
2014-05-23  3:33 ` [PATCH 3/5] mm: convert i_mmap_mutex to rwsem Davidlohr Bueso
2014-05-23 17:33   ` Rik van Riel
2014-05-23  3:33 ` [PATCH 4/5] mm/rmap: share the i_mmap_rwsem Davidlohr Bueso
2014-05-23 18:35   ` Rik van Riel
2014-05-26 19:35   ` Hugh Dickins
2014-05-26 20:48     ` Davidlohr Bueso
2014-05-23  3:33 ` [PATCH 5/5] mm: rename leftover i_mmap_mutex Davidlohr Bueso
2014-05-23 18:36   ` Rik van Riel
2014-05-30  2:20 ` [PATCH 0/5] mm: i_mmap_mutex to rwsem Davidlohr Bueso
2014-06-02 20:08   ` Andrew Morton
2014-06-02 20:31     ` Davidlohr Bueso
2014-06-02 23:54       ` Hugh Dickins
  -- strict thread matches above, loose matches on Subject: below --
2013-06-25  0:21 Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).