linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <andrea@qumranet.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <clameter@sgi.com>,
	Jack Steiner <steiner@sgi.com>, Robin Holt <holt@sgi.com>,
	Nick Piggin <npiggin@suse.de>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	kvm-devel@lists.sourceforge.net,
	Kanoj Sarcar <kanojsarcar@yahoo.com>,
	Roland Dreier <rdreier@cisco.com>,
	Steve Wise <swise@opengridcomputing.com>,
	linux-kernel@vger.kernel.org, Avi Kivity <avi@qumranet.com>,
	linux-mm@kvack.org, general@lists.openfabrics.org,
	Hugh Dickins <hugh@veritas.com>,
	akpm@linux-foundation.org, Rusty Russell <rusty@rustcorp.com.au>,
	Anthony Liguori <aliguori@us.ibm.com>,
	Chris Wright <chrisw@redhat.com>,
	Marcelo Tosatti <marcelo@kvack.org>,
	Eric Dumazet <dada1@cosmosbay.com>,
	"Paul E. McKenney" <paulmck@us.ibm.com>
Subject: [PATCH 08 of 11] anon-vma-rwsem
Date: Wed, 07 May 2008 16:35:58 +0200	[thread overview]
Message-ID: <6b384bb988786aa78ef0.1210170958@duo.random> (raw)
In-Reply-To: <patchbomb.1210170950@duo.random>

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1210115136 -7200
# Node ID 6b384bb988786aa78ef07440180e4b2948c4c6a2
# Parent  58f716ad4d067afb6bdd1b5f7042e19d854aae0d
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	atomic_t refcount;	/* vmas on the list */
+	struct rw_semaphore sem;/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+	atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
 		return;
 
 	/*
-	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+	 * We hold either the mmap_sem lock or a reference on the
+	 * anon_vma. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	down_read(&anon_vma->sem);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	up_read(&anon_vma->sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 
 	if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * we cannot notice that anon_vma is freed while we migrate a page.
 	 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 	 * of migration. File cache pages are no problem because of page_lock()
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = grab_anon_vma(page);
 
 	/*
 	 * Corner case handling:
@@ -681,10 +680,7 @@ static int unmap_and_move(new_page_t get
 		if (!PageAnon(page) && PagePrivate(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 		}
@@ -705,8 +701,8 @@ static int unmap_and_move(new_page_t get
 	} else if (charge)
  		mem_cgroup_end_migration(newpage);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 
 unlock:
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -570,7 +570,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -624,7 +624,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	if (mapping)
 		up_write(&mapping->i_mmap_sem);
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -69,7 +69,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			down_write(&locked->sem);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -81,6 +81,7 @@ int anon_vma_prepare(struct vm_area_stru
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
+			get_anon_vma(anon_vma);
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 			allocated = NULL;
@@ -88,7 +89,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			up_write(&locked->sem);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -99,14 +100,17 @@ void __anon_vma_merge(struct vm_area_str
 {
 	BUG_ON(vma->anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	put_anon_vma(vma->anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
+		get_anon_vma(anon_vma);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -114,36 +118,32 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		get_anon_vma(anon_vma);
+		down_write(&anon_vma->sem);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	}
 }
 
 void anon_vma_unlink(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	int empty;
 
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	down_write(&anon_vma->sem);
 	list_del(&vma->anon_vma_node);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
-
-	if (empty)
-		anon_vma_free(anon_vma);
+	up_write(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 static void anon_vma_ctor(struct kmem_cache *cachep, void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	init_rwsem(&anon_vma->sem);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -157,9 +157,9 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *grab_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -170,17 +170,26 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
-	return anon_vma;
+	if (!atomic_inc_not_zero(&anon_vma->refcount))
+		anon_vma = NULL;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+static struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = grab_anon_vma(page);
+
+	if (anon_vma)
+		down_read(&anon_vma->sem);
+	return anon_vma;
 }
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	up_read(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 /*

  parent reply	other threads:[~2008-05-07 14:41 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-07 14:35 [PATCH 00 of 11] mmu notifier #v16 Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
2008-05-07 17:35   ` Rik van Riel
2008-05-07 20:02   ` Andrew Morton
2008-05-07 20:05   ` Andrew Morton
2008-05-07 20:30     ` Linus Torvalds
2008-05-07 21:58       ` Andrea Arcangeli
2008-05-07 22:11         ` Linus Torvalds
2008-05-07 22:27           ` Andrea Arcangeli
2008-05-07 22:31             ` [ofa-general] " Roland Dreier
2008-05-07 22:39               ` Andrea Arcangeli
2008-05-07 23:03                 ` Linus Torvalds
2008-05-07 22:37             ` Andrea Arcangeli
2008-05-07 23:38               ` Linus Torvalds
2008-05-07 23:00             ` Linus Torvalds
2008-05-07 14:35 ` [PATCH 02 of 11] get_task_mm Andrea Arcangeli
2008-05-07 15:59   ` Robin Holt
2008-05-07 16:20     ` Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 03 of 11] invalidate_page outside PT lock Andrea Arcangeli
2008-05-07 17:39   ` Rik van Riel
2008-05-07 17:57     ` Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 04 of 11] free-pgtables Andrea Arcangeli
2008-05-07 17:41   ` Rik van Riel
2008-05-07 14:35 ` [PATCH 05 of 11] unmap vmas tlb flushing Andrea Arcangeli
2008-05-07 17:46   ` Rik van Riel
2008-05-07 14:35 ` [PATCH 06 of 11] rwsem contended Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 07 of 11] i_mmap_rwsem Andrea Arcangeli
2008-05-07 14:35 ` Andrea Arcangeli [this message]
2008-05-07 20:56   ` [PATCH 08 of 11] anon-vma-rwsem Linus Torvalds
2008-05-07 21:26     ` Andrea Arcangeli
2008-05-07 21:36       ` Linus Torvalds
2008-05-07 22:22         ` Andrea Arcangeli
2008-05-07 22:31           ` Andrew Morton
2008-05-07 22:44             ` Andrea Arcangeli
2008-05-07 22:59               ` Andrew Morton
2008-05-07 23:19                 ` Linus Torvalds
2008-05-07 23:39                   ` Christoph Lameter
2008-05-08  0:03                     ` Linus Torvalds
2008-05-08  0:52                       ` Robin Holt
2008-05-08  0:56                       ` Christoph Lameter
2008-05-08  1:07                         ` Linus Torvalds
2008-05-08  1:39                         ` Linus Torvalds
2008-05-08  1:52                           ` Andrea Arcangeli
2008-05-08  1:57                             ` Linus Torvalds
2008-05-08  2:24                               ` Andrea Arcangeli
2008-05-08  2:32                                 ` Linus Torvalds
2008-05-07 23:39                 ` Andrea Arcangeli
2008-05-08  1:02                   ` Linus Torvalds
2008-05-08  1:12                     ` Christoph Lameter
2008-05-08  1:32                       ` Linus Torvalds
2008-05-08  2:56                       ` Andrea Arcangeli
2008-05-08  3:10                         ` Christoph Lameter
2008-05-08  3:41                           ` Andrea Arcangeli
2008-05-08  4:14                             ` Linus Torvalds
2008-05-08  5:20                               ` Andrea Arcangeli
2008-05-08  5:27                                 ` Pekka Enberg
2008-05-08  5:30                                   ` Pekka Enberg
2008-05-08  5:49                                     ` Andrea Arcangeli
2008-05-08 15:03                                 ` Linus Torvalds
2008-05-08 16:11                                   ` Linus Torvalds
2008-05-08 22:01                                     ` Andrea Arcangeli
2008-05-09 18:37                                     ` Peter Zijlstra
2008-05-09 18:55                                       ` Andrea Arcangeli
2008-05-09 19:04                                         ` Peter Zijlstra
2008-05-08  1:26                     ` Andrea Arcangeli
2008-05-07 23:28               ` Benjamin Herrenschmidt
2008-05-07 23:45                 ` Andrea Arcangeli
2008-05-08  1:34                   ` Andrea Arcangeli
2008-05-13 12:14                     ` Nick Piggin
2008-05-14  5:43                       ` Benjamin Herrenschmidt
2008-05-14  6:06                         ` Nick Piggin
2008-05-14 13:15                         ` Jack Steiner
2008-05-07 22:44           ` Linus Torvalds
2008-05-07 22:58             ` Andrea Arcangeli
2008-05-07 23:02               ` Andrea Arcangeli
2008-05-07 23:09               ` Linus Torvalds
2008-05-08  0:38         ` Robin Holt
2008-05-08  0:55           ` Linus Torvalds
2008-05-13 12:06           ` Nick Piggin
2008-05-13 15:32             ` Robin Holt
2008-05-14  4:11               ` Nick Piggin
2008-05-14 11:26                 ` Robin Holt
2008-05-14 15:18                   ` Linus Torvalds
2008-05-14 16:22                     ` Robin Holt
2008-05-14 16:56                       ` Linus Torvalds
2008-05-14 17:57                     ` Christoph Lameter
2008-05-14 18:27                       ` Linus Torvalds
2008-05-17  1:38                         ` mm notifier: Notifications when pages are unmapped Christoph Lameter
2008-05-15  7:57                   ` [PATCH 08 of 11] anon-vma-rwsem Nick Piggin
2008-05-15 11:01                     ` Robin Holt
2008-05-15 11:12                       ` Avi Kivity
2008-05-15 17:33                     ` Christoph Lameter
2008-05-15 23:52                       ` Nick Piggin
2008-05-16 11:23                         ` Robin Holt
2008-05-16 11:50                           ` Robin Holt
2008-05-20  5:31                             ` Nick Piggin
2008-05-20 10:01                               ` Robin Holt
2008-05-20 10:50                                 ` Nick Piggin
2008-05-20 11:05                                   ` Robin Holt
2008-05-20 11:14                                     ` Nick Piggin
2008-05-20 11:26                                       ` Robin Holt
2008-05-07 22:42       ` Jack Steiner
2008-05-07 14:35 ` [PATCH 09 of 11] mm_lock-rwsem Andrea Arcangeli
2008-05-07 14:36 ` [PATCH 10 of 11] export zap_page_range for XPMEM Andrea Arcangeli
2008-05-07 14:36 ` [PATCH 11 of 11] mmap sems Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 08 of 11] anon-vma-rwsem Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6b384bb988786aa78ef0.1210170958@duo.random \
    --to=andrea@qumranet.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=aliguori@us.ibm.com \
    --cc=avi@qumranet.com \
    --cc=chrisw@redhat.com \
    --cc=clameter@sgi.com \
    --cc=dada1@cosmosbay.com \
    --cc=general@lists.openfabrics.org \
    --cc=holt@sgi.com \
    --cc=hugh@veritas.com \
    --cc=kanojsarcar@yahoo.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=marcelo@kvack.org \
    --cc=npiggin@suse.de \
    --cc=paulmck@us.ibm.com \
    --cc=rdreier@cisco.com \
    --cc=rusty@rustcorp.com.au \
    --cc=steiner@sgi.com \
    --cc=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).