All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Use RCU to stabilize page counts
@ 2011-08-19  7:48 ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

include/linux/pagemap.h describes the protocol one should use to get pages
from page cache - one can't know if the reference they get will be on the
desired page, so newly allocated pages might see elevated reference counts,
but using RCU this effect can be limited in time to one RCU grace period.

For this protocol to work, every call site of get_page_unless_zero() has to
participate, and this was not previously enforced.

Patches 1-3 convert some get_page_unless_zero() call sites to use the proper
RCU protocol as described in pagemap.h

Patches 4-5 convert some get_page_unless_zero() call sites to just call
get_page()

Patch 6 asserts that every remaining get_page_unless_zero() call site should
participate in the RCU protocol. Well, not actually all of them -
__isolate_rcu_page() is exempted because it holds the zone LRU lock which
would prevent the given page from getting entirely freed, and a few others
related to hwpoison, memory hotplug and memory failure are exempted because
I haven't been able to figure out what to do.

Patch 7 is a placeholder for an RCU API extension we have been talking about
with Paul McKenney. The idea is to record an initial time as an opaque cookie,
and to be able to determine later on if an rcu grace period has elapsed since
that initial time.

Patch 8 adds wrapper functions to store an RCU cookie into compound pages.

Patch 9 makes use of new RCU API, as well as the prior fixes from patches 1-6,
to ensure tail page counts are stable while we split THP pages. This fixes a
(rather theorical, not actually been observed) race condition where THP page
splitting could result in incorrect page counts if THP page allocation and
splitting both occur while another thread tries to run get_page_unless_zero
on a single page that got re-allocated as THP tail page.


The patches have received only a limited amount of testing; however I
believe patches 1-6 to be sane and I would like them to get more
exposure, maybe as part of andrew's -mm tree.


Besides that, this proposal is also to sync up with Paul regarding the RCU
functionality :)


Michel Lespinasse (9):
  mm: rcu read lock for getting reference on pages in
    migration_entry_wait()
  mm: avoid calling get_page_unless_zero() when charging cgroups
  mm: rcu read lock when getting from tail to head page
  mm: use get_page in deactivate_page()
  kvm: use get_page instead of get_page_unless_zero
  mm: assert that get_page_unless_zero() callers hold the rcu lock
  rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins
  mm: add API for setting a grace period cookie on compound pages
  mm: make sure tail page counts are stable before splitting THP pages

 arch/x86/kvm/mmu.c       |    3 +--
 include/linux/mm.h       |   38 +++++++++++++++++++++++++++++++++++++-
 include/linux/mm_types.h |    6 +++++-
 include/linux/pagemap.h  |    1 +
 include/linux/rcupdate.h |   35 +++++++++++++++++++++++++++++++++++
 mm/huge_memory.c         |   33 +++++++++++++++++++++++++++++----
 mm/hwpoison-inject.c     |    2 +-
 mm/ksm.c                 |    4 ++++
 mm/memcontrol.c          |   20 ++++++++++----------
 mm/memory-failure.c      |    6 +++---
 mm/memory_hotplug.c      |    2 +-
 mm/migrate.c             |    3 +++
 mm/page_alloc.c          |    1 +
 mm/swap.c                |   22 ++++++++++++++--------
 mm/vmscan.c              |    7 ++++++-
 15 files changed, 151 insertions(+), 32 deletions(-)

-- 
1.7.3.1


^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH 0/9] Use RCU to stabilize page counts
@ 2011-08-19  7:48 ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

include/linux/pagemap.h describes the protocol one should use to get pages
from page cache - one can't know if the reference they get will be on the
desired page, so newly allocated pages might see elevated reference counts,
but using RCU this effect can be limited in time to one RCU grace period.

For this protocol to work, every call site of get_page_unless_zero() has to
participate, and this was not previously enforced.

Patches 1-3 convert some get_page_unless_zero() call sites to use the proper
RCU protocol as described in pagemap.h

Patches 4-5 convert some get_page_unless_zero() call sites to just call
get_page()

Patch 6 asserts that every remaining get_page_unless_zero() call site should
participate in the RCU protocol. Well, not actually all of them -
__isolate_rcu_page() is exempted because it holds the zone LRU lock which
would prevent the given page from getting entirely freed, and a few others
related to hwpoison, memory hotplug and memory failure are exempted because
I haven't been able to figure out what to do.

Patch 7 is a placeholder for an RCU API extension we have been talking about
with Paul McKenney. The idea is to record an initial time as an opaque cookie,
and to be able to determine later on if an rcu grace period has elapsed since
that initial time.

Patch 8 adds wrapper functions to store an RCU cookie into compound pages.

Patch 9 makes use of new RCU API, as well as the prior fixes from patches 1-6,
to ensure tail page counts are stable while we split THP pages. This fixes a
(rather theorical, not actually been observed) race condition where THP page
splitting could result in incorrect page counts if THP page allocation and
splitting both occur while another thread tries to run get_page_unless_zero
on a single page that got re-allocated as THP tail page.


The patches have received only a limited amount of testing; however I
believe patches 1-6 to be sane and I would like them to get more
exposure, maybe as part of andrew's -mm tree.


Besides that, this proposal is also to sync up with Paul regarding the RCU
functionality :)


Michel Lespinasse (9):
  mm: rcu read lock for getting reference on pages in
    migration_entry_wait()
  mm: avoid calling get_page_unless_zero() when charging cgroups
  mm: rcu read lock when getting from tail to head page
  mm: use get_page in deactivate_page()
  kvm: use get_page instead of get_page_unless_zero
  mm: assert that get_page_unless_zero() callers hold the rcu lock
  rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins
  mm: add API for setting a grace period cookie on compound pages
  mm: make sure tail page counts are stable before splitting THP pages

 arch/x86/kvm/mmu.c       |    3 +--
 include/linux/mm.h       |   38 +++++++++++++++++++++++++++++++++++++-
 include/linux/mm_types.h |    6 +++++-
 include/linux/pagemap.h  |    1 +
 include/linux/rcupdate.h |   35 +++++++++++++++++++++++++++++++++++
 mm/huge_memory.c         |   33 +++++++++++++++++++++++++++++----
 mm/hwpoison-inject.c     |    2 +-
 mm/ksm.c                 |    4 ++++
 mm/memcontrol.c          |   20 ++++++++++----------
 mm/memory-failure.c      |    6 +++---
 mm/memory_hotplug.c      |    2 +-
 mm/migrate.c             |    3 +++
 mm/page_alloc.c          |    1 +
 mm/swap.c                |   22 ++++++++++++++--------
 mm/vmscan.c              |    7 ++++++-
 15 files changed, 151 insertions(+), 32 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH 1/9] mm: rcu read lock for getting reference on pages in migration_entry_wait()
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

migration_entry_wait() needs to take the rcu read lock so that page counts
can be guaranteed to be stable after one rcu grace period.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/migrate.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 666e4e6..6f3b5db 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -193,6 +193,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	struct page *page;
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	rcu_read_lock();
 	pte = *ptep;
 	if (!is_swap_pte(pte))
 		goto out;
@@ -212,11 +213,13 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	 */
 	if (!get_page_unless_zero(page))
 		goto out;
+	rcu_read_unlock();
 	pte_unmap_unlock(ptep, ptl);
 	wait_on_page_locked(page);
 	put_page(page);
 	return;
 out:
+	rcu_read_unlock();
 	pte_unmap_unlock(ptep, ptl);
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 1/9] mm: rcu read lock for getting reference on pages in migration_entry_wait()
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

migration_entry_wait() needs to take the rcu read lock so that page counts
can be guaranteed to be stable after one rcu grace period.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/migrate.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 666e4e6..6f3b5db 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -193,6 +193,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	struct page *page;
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	rcu_read_lock();
 	pte = *ptep;
 	if (!is_swap_pte(pte))
 		goto out;
@@ -212,11 +213,13 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 	 */
 	if (!get_page_unless_zero(page))
 		goto out;
+	rcu_read_unlock();
 	pte_unmap_unlock(ptep, ptl);
 	wait_on_page_locked(page);
 	put_page(page);
 	return;
 out:
+	rcu_read_unlock();
 	pte_unmap_unlock(ptep, ptl);
 }
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/9] mm: avoid calling get_page_unless_zero() when charging cgroups
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

In mem_cgroup_move_parent(), we can avoid the get_page_unless_zero()
call by taking a page reference under protection of the zone LRU lock
in mem_cgroup_force_empty_list().

In mc_handle_present_pte(), the page count is already known to be
nonzero as there is a PTE pointing to it and the page table lock is held.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memcontrol.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..f9439ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2649,7 +2649,7 @@ out:
 }
 
 /*
- * move charges to its parent.
+ * move charges to its parent. Caller must hold a reference on page.
  */
 
 static int mem_cgroup_move_parent(struct page *page,
@@ -2669,10 +2669,8 @@ static int mem_cgroup_move_parent(struct page *page,
 		return -EINVAL;
 
 	ret = -EBUSY;
-	if (!get_page_unless_zero(page))
-		goto out;
 	if (isolate_lru_page(page))
-		goto put;
+		goto out;
 
 	nr_pages = hpage_nr_pages(page);
 
@@ -2692,8 +2690,6 @@ static int mem_cgroup_move_parent(struct page *page,
 		compound_unlock_irqrestore(page, flags);
 put_back:
 	putback_lru_page(page);
-put:
-	put_page(page);
 out:
 	return ret;
 }
@@ -3732,11 +3728,12 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
-
 		page = lookup_cgroup_page(pc);
+		get_page(page);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 		ret = mem_cgroup_move_parent(page, pc, mem, GFP_KERNEL);
+		put_page(page);
 		if (ret == -ENOMEM)
 			break;
 
@@ -5133,9 +5130,12 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 	} else if (!move_file())
 		/* we ignore mapcount for file pages */
 		return NULL;
-	if (!get_page_unless_zero(page))
-		return NULL;
 
+	/*
+	 * The page reference count is guaranteed to be nonzero since
+	 * ptent points to that page and the page table lock is held.
+	 */
+	get_page(page);
 	return page;
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/9] mm: avoid calling get_page_unless_zero() when charging cgroups
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

In mem_cgroup_move_parent(), we can avoid the get_page_unless_zero()
call by taking a page reference under protection of the zone LRU lock
in mem_cgroup_force_empty_list().

In mc_handle_present_pte(), the page count is already known to be
nonzero as there is a PTE pointing to it and the page table lock is held.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memcontrol.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..f9439ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2649,7 +2649,7 @@ out:
 }
 
 /*
- * move charges to its parent.
+ * move charges to its parent. Caller must hold a reference on page.
  */
 
 static int mem_cgroup_move_parent(struct page *page,
@@ -2669,10 +2669,8 @@ static int mem_cgroup_move_parent(struct page *page,
 		return -EINVAL;
 
 	ret = -EBUSY;
-	if (!get_page_unless_zero(page))
-		goto out;
 	if (isolate_lru_page(page))
-		goto put;
+		goto out;
 
 	nr_pages = hpage_nr_pages(page);
 
@@ -2692,8 +2690,6 @@ static int mem_cgroup_move_parent(struct page *page,
 		compound_unlock_irqrestore(page, flags);
 put_back:
 	putback_lru_page(page);
-put:
-	put_page(page);
 out:
 	return ret;
 }
@@ -3732,11 +3728,12 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
-
 		page = lookup_cgroup_page(pc);
+		get_page(page);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 		ret = mem_cgroup_move_parent(page, pc, mem, GFP_KERNEL);
+		put_page(page);
 		if (ret == -ENOMEM)
 			break;
 
@@ -5133,9 +5130,12 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 	} else if (!move_file())
 		/* we ignore mapcount for file pages */
 		return NULL;
-	if (!get_page_unless_zero(page))
-		return NULL;
 
+	/*
+	 * The page reference count is guaranteed to be nonzero since
+	 * ptent points to that page and the page table lock is held.
+	 */
+	get_page(page);
 	return page;
 }
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 3/9] mm: rcu read lock when getting from tail to head page
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

In the tail page case, put_compound_page() uses get_page_unless_zero()
to get a reference on the head page. There is a small possibility that
the compound page might get split, and the head page freed, before that
reference can be obtained.

Similarly, page_trans_compound_anon_split() needs to get a reference
on a a THP page's head before it can proceed with splitting it.

In order to guarantee page count stability one rcu grace period after
allocation, as described in page_cache_get_speculative() comment in
pagemap.h, we need to take the rcu read lock from the time we locate the
head page until we get a reference on it.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/ksm.c  |    4 ++++
 mm/swap.c |    8 +++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 9a68b0c..0eec889 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -817,10 +817,12 @@ out:
 static int page_trans_compound_anon_split(struct page *page)
 {
 	int ret = 0;
+	rcu_read_lock();
 	struct page *transhuge_head = page_trans_compound_anon(page);
 	if (transhuge_head) {
 		/* Get the reference on the head to split it. */
 		if (get_page_unless_zero(transhuge_head)) {
+			rcu_read_unlock();
 			/*
 			 * Recheck we got the reference while the head
 			 * was still anonymous.
@@ -834,10 +836,12 @@ static int page_trans_compound_anon_split(struct page *page)
 				 */
 				ret = 1;
 			put_page(transhuge_head);
+			return ret;
 		} else
 			/* Retry later if split_huge_page run from under us. */
 			ret = 1;
 	}
+	rcu_read_unlock();
 	return ret;
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..ac617dc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,7 +78,10 @@ static void put_compound_page(struct page *page)
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
+		struct page *page_head;
+
+		rcu_read_lock();
+		page_head = page->first_page;
 		smp_rmb();
 		/*
 		 * If PageTail is still set after smp_rmb() we can be sure
@@ -87,6 +90,8 @@ static void put_compound_page(struct page *page)
 		 */
 		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
 			unsigned long flags;
+
+			rcu_read_unlock();
 			/*
 			 * Verify that our page_head wasn't converted
 			 * to a a regular page before we got a
@@ -140,6 +145,7 @@ static void put_compound_page(struct page *page)
 			}
 		} else {
 			/* page_head is a dangling pointer */
+			rcu_read_unlock();
 			VM_BUG_ON(PageTail(page));
 			goto out_put_single;
 		}
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 3/9] mm: rcu read lock when getting from tail to head page
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

In the tail page case, put_compound_page() uses get_page_unless_zero()
to get a reference on the head page. There is a small possibility that
the compound page might get split, and the head page freed, before that
reference can be obtained.

Similarly, page_trans_compound_anon_split() needs to get a reference
on a a THP page's head before it can proceed with splitting it.

In order to guarantee page count stability one rcu grace period after
allocation, as described in page_cache_get_speculative() comment in
pagemap.h, we need to take the rcu read lock from the time we locate the
head page until we get a reference on it.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/ksm.c  |    4 ++++
 mm/swap.c |    8 +++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 9a68b0c..0eec889 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -817,10 +817,12 @@ out:
 static int page_trans_compound_anon_split(struct page *page)
 {
 	int ret = 0;
+	rcu_read_lock();
 	struct page *transhuge_head = page_trans_compound_anon(page);
 	if (transhuge_head) {
 		/* Get the reference on the head to split it. */
 		if (get_page_unless_zero(transhuge_head)) {
+			rcu_read_unlock();
 			/*
 			 * Recheck we got the reference while the head
 			 * was still anonymous.
@@ -834,10 +836,12 @@ static int page_trans_compound_anon_split(struct page *page)
 				 */
 				ret = 1;
 			put_page(transhuge_head);
+			return ret;
 		} else
 			/* Retry later if split_huge_page run from under us. */
 			ret = 1;
 	}
+	rcu_read_unlock();
 	return ret;
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..ac617dc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,7 +78,10 @@ static void put_compound_page(struct page *page)
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
+		struct page *page_head;
+
+		rcu_read_lock();
+		page_head = page->first_page;
 		smp_rmb();
 		/*
 		 * If PageTail is still set after smp_rmb() we can be sure
@@ -87,6 +90,8 @@ static void put_compound_page(struct page *page)
 		 */
 		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
 			unsigned long flags;
+
+			rcu_read_unlock();
 			/*
 			 * Verify that our page_head wasn't converted
 			 * to a a regular page before we got a
@@ -140,6 +145,7 @@ static void put_compound_page(struct page *page)
 			}
 		} else {
 			/* page_head is a dangling pointer */
+			rcu_read_unlock();
 			VM_BUG_ON(PageTail(page));
 			goto out_put_single;
 		}
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 4/9] mm: use get_page in deactivate_page()
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

deactivate_page() already holds a reference to the page, so it can
use get_page() instead of get_page_unless_zero().

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/swap.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index ac617dc..11574b1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -517,6 +517,8 @@ static void drain_cpu_pagevecs(int cpu)
  */
 void deactivate_page(struct page *page)
 {
+	struct pagevec *pvec;
+
 	/*
 	 * In a workload with many unevictable page such as mprotect, unevictable
 	 * page deactivation for accelerating reclaim is pointless.
@@ -524,13 +526,11 @@ void deactivate_page(struct page *page)
 	if (PageUnevictable(page))
 		return;
 
-	if (likely(get_page_unless_zero(page))) {
-		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
-
-		if (!pagevec_add(pvec, page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
-		put_cpu_var(lru_deactivate_pvecs);
-	}
+	get_page(page);
+	pvec = &get_cpu_var(lru_deactivate_pvecs);
+	if (!pagevec_add(pvec, page))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+	put_cpu_var(lru_deactivate_pvecs);
 }
 
 void lru_add_drain(void)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 4/9] mm: use get_page in deactivate_page()
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

deactivate_page() already holds a reference to the page, so it can
use get_page() instead of get_page_unless_zero().

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/swap.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index ac617dc..11574b1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -517,6 +517,8 @@ static void drain_cpu_pagevecs(int cpu)
  */
 void deactivate_page(struct page *page)
 {
+	struct pagevec *pvec;
+
 	/*
 	 * In a workload with many unevictable page such as mprotect, unevictable
 	 * page deactivation for accelerating reclaim is pointless.
@@ -524,13 +526,11 @@ void deactivate_page(struct page *page)
 	if (PageUnevictable(page))
 		return;
 
-	if (likely(get_page_unless_zero(page))) {
-		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
-
-		if (!pagevec_add(pvec, page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
-		put_cpu_var(lru_deactivate_pvecs);
-	}
+	get_page(page);
+	pvec = &get_cpu_var(lru_deactivate_pvecs);
+	if (!pagevec_add(pvec, page))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+	put_cpu_var(lru_deactivate_pvecs);
 }
 
 void lru_add_drain(void)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 5/9] kvm: use get_page instead of get_page_unless_zero
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

In transparent_hugepage_adjust(), we can use get_page instead of
get_page_unless_zero and an assertion that the count was not zero.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/kvm/mmu.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aee3862..d9b7f0c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2353,8 +2353,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 			*gfnp = gfn;
 			kvm_release_pfn_clean(pfn);
 			pfn &= ~mask;
-			if (!get_page_unless_zero(pfn_to_page(pfn)))
-				BUG();
+			get_page(pfn_to_page(pfn));
 			*pfnp = pfn;
 		}
 	}
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 5/9] kvm: use get_page instead of get_page_unless_zero
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

In transparent_hugepage_adjust(), we can use get_page instead of
get_page_unless_zero and an assertion that the count was not zero.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/kvm/mmu.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aee3862..d9b7f0c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2353,8 +2353,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 			*gfnp = gfn;
 			kvm_release_pfn_clean(pfn);
 			pfn &= ~mask;
-			if (!get_page_unless_zero(pfn_to_page(pfn)))
-				BUG();
+			get_page(pfn_to_page(pfn));
 			*pfnp = pfn;
 		}
 	}
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 6/9] mm: assert that get_page_unless_zero() callers hold the rcu lock
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

In order to guarantee that page counts are stable one rcu grace period
after page allocation, it is important that get_page_unless_zero
call sites follow the proper protocol and hold the rcu read lock from
the time they locate the desired page until they get a reference on it.

__isolate_lru_page() is exempted - it knows the page it's trying to get
a reference on can't get fully freed, as it is on LRU list and it holds
the zone LRU lock.

Other call sites in memory_hotplug.c, memory_failure.c and hwpoison-inject.c
are also exempted. It would be preferable if someone more familiar with
these features could determine if that's safe.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h      |   16 +++++++++++++++-
 include/linux/pagemap.h |    1 +
 mm/hwpoison-inject.c    |    2 +-
 mm/memory-failure.c     |    6 +++---
 mm/memory_hotplug.c     |    2 +-
 mm/vmscan.c             |    7 ++++++-
 6 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..9ff5f2d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -275,13 +275,27 @@ static inline int put_page_testzero(struct page *page)
 	return atomic_dec_and_test(&page->_count);
 }
 
+static inline int __get_page_unless_zero(struct page *page)
+{
+	return atomic_inc_not_zero(&page->_count);
+}
+
 /*
  * Try to grab a ref unless the page has a refcount of zero, return false if
  * that is the case.
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-	return atomic_inc_not_zero(&page->_count);
+	/*
+	 * See page_cache_get_speculative() comment in pagemap.h
+	 * Note that for page counts to be guaranteed stable one
+	 * RCU grace period after they've been allocated,
+	 * all get_page_unless_zero call sites have to participate
+	 * by taking an rcu read lock before locating the desired page
+	 * and until getting a reference on it.
+	 */
+	VM_BUG_ON(!rcu_read_lock_held());
+	return __get_page_unless_zero(page);
 }
 
 extern int page_is_ram(unsigned long pfn);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..736f47d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -131,6 +131,7 @@ void release_pages(struct page **pages, int nr, int cold);
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
+	VM_BUG_ON(!rcu_read_lock_held());
 	VM_BUG_ON(in_interrupt());
 
 #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU)
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index c7fc7fd..87e027b 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -30,7 +30,7 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * This implies unable to support free buddy pages.
 	 */
-	if (!get_page_unless_zero(hpage))
+	if (!__get_page_unless_zero(hpage))
 		return 0;
 
 	if (!PageLRU(p) && !PageHuge(p))
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 740c4f5..6fc0409 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1025,7 +1025,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
 	if (!(flags & MF_COUNT_INCREASED) &&
-		!get_page_unless_zero(hpage)) {
+		!__get_page_unless_zero(hpage)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, "free buddy", DELAYED);
 			return 0;
@@ -1210,7 +1210,7 @@ int unpoison_memory(unsigned long pfn)
 
 	nr_pages = 1 << compound_trans_order(page);
 
-	if (!get_page_unless_zero(page)) {
+	if (!__get_page_unless_zero(page)) {
 		/*
 		 * Since HWPoisoned hugepage should have non-zero refcount,
 		 * race between memory failure and unpoison seems to happen.
@@ -1289,7 +1289,7 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
 	 */
-	if (!get_page_unless_zero(compound_head(p))) {
+	if (!__get_page_unless_zero(compound_head(p))) {
 		if (PageHuge(p)) {
 			pr_info("get_any_page: %#lx free huge page\n", pfn);
 			ret = dequeue_hwpoisoned_huge_page(compound_head(p));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..cf57dfc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -710,7 +710,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		if (!pfn_valid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
-		if (!get_page_unless_zero(page))
+		if (!__get_page_unless_zero(page))
 			continue;
 		/*
 		 * We can skip free pages. And we can only deal with pages on
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d036e59..4c167da 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1001,11 +1001,16 @@ int __isolate_lru_page(struct page *page, int mode, int file)
 
 	ret = -EBUSY;
 
-	if (likely(get_page_unless_zero(page))) {
+	if (likely(__get_page_unless_zero(page))) {
 		/*
 		 * Be careful not to clear PageLRU until after we're
 		 * sure the page is not being freed elsewhere -- the
 		 * page release code relies on it.
+		 *
+		 * We are able to use the __get_page_unless_zero() variant
+		 * because we know the page can't get fully freed before we
+		 * get the reference on it - as it is on LRU list and we
+		 * hold the zone LRU lock.
 		 */
 		ClearPageLRU(page);
 		ret = 0;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 6/9] mm: assert that get_page_unless_zero() callers hold the rcu lock
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

In order to guarantee that page counts are stable one rcu grace period
after page allocation, it is important that get_page_unless_zero
call sites follow the proper protocol and hold the rcu read lock from
the time they locate the desired page until they get a reference on it.

__isolate_lru_page() is exempted - it knows the page it's trying to get
a reference on can't get fully freed, as it is on LRU list and it holds
the zone LRU lock.

Other call sites in memory_hotplug.c, memory_failure.c and hwpoison-inject.c
are also exempted. It would be preferable if someone more familiar with
these features could determine if that's safe.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h      |   16 +++++++++++++++-
 include/linux/pagemap.h |    1 +
 mm/hwpoison-inject.c    |    2 +-
 mm/memory-failure.c     |    6 +++---
 mm/memory_hotplug.c     |    2 +-
 mm/vmscan.c             |    7 ++++++-
 6 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..9ff5f2d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -275,13 +275,27 @@ static inline int put_page_testzero(struct page *page)
 	return atomic_dec_and_test(&page->_count);
 }
 
+static inline int __get_page_unless_zero(struct page *page)
+{
+	return atomic_inc_not_zero(&page->_count);
+}
+
 /*
  * Try to grab a ref unless the page has a refcount of zero, return false if
  * that is the case.
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-	return atomic_inc_not_zero(&page->_count);
+	/*
+	 * See page_cache_get_speculative() comment in pagemap.h
+	 * Note that for page counts to be guaranteed stable one
+	 * RCU grace period after they've been allocated,
+	 * all get_page_unless_zero call sites have to participate
+	 * by taking an rcu read lock before locating the desired page
+	 * and until getting a reference on it.
+	 */
+	VM_BUG_ON(!rcu_read_lock_held());
+	return __get_page_unless_zero(page);
 }
 
 extern int page_is_ram(unsigned long pfn);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..736f47d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -131,6 +131,7 @@ void release_pages(struct page **pages, int nr, int cold);
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
+	VM_BUG_ON(!rcu_read_lock_held());
 	VM_BUG_ON(in_interrupt());
 
 #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU)
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index c7fc7fd..87e027b 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -30,7 +30,7 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * This implies unable to support free buddy pages.
 	 */
-	if (!get_page_unless_zero(hpage))
+	if (!__get_page_unless_zero(hpage))
 		return 0;
 
 	if (!PageLRU(p) && !PageHuge(p))
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 740c4f5..6fc0409 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1025,7 +1025,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
 	if (!(flags & MF_COUNT_INCREASED) &&
-		!get_page_unless_zero(hpage)) {
+		!__get_page_unless_zero(hpage)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, "free buddy", DELAYED);
 			return 0;
@@ -1210,7 +1210,7 @@ int unpoison_memory(unsigned long pfn)
 
 	nr_pages = 1 << compound_trans_order(page);
 
-	if (!get_page_unless_zero(page)) {
+	if (!__get_page_unless_zero(page)) {
 		/*
 		 * Since HWPoisoned hugepage should have non-zero refcount,
 		 * race between memory failure and unpoison seems to happen.
@@ -1289,7 +1289,7 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
 	 */
-	if (!get_page_unless_zero(compound_head(p))) {
+	if (!__get_page_unless_zero(compound_head(p))) {
 		if (PageHuge(p)) {
 			pr_info("get_any_page: %#lx free huge page\n", pfn);
 			ret = dequeue_hwpoisoned_huge_page(compound_head(p));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..cf57dfc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -710,7 +710,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		if (!pfn_valid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
-		if (!get_page_unless_zero(page))
+		if (!__get_page_unless_zero(page))
 			continue;
 		/*
 		 * We can skip free pages. And we can only deal with pages on
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d036e59..4c167da 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1001,11 +1001,16 @@ int __isolate_lru_page(struct page *page, int mode, int file)
 
 	ret = -EBUSY;
 
-	if (likely(get_page_unless_zero(page))) {
+	if (likely(__get_page_unless_zero(page))) {
 		/*
 		 * Be careful not to clear PageLRU until after we're
 		 * sure the page is not being freed elsewhere -- the
 		 * page release code relies on it.
+		 *
+		 * We are able to use the __get_page_unless_zero() variant
+		 * because we know the page can't get fully freed before we
+		 * get the reference on it - as it is on LRU list and we
+		 * hold the zone LRU lock.
 		 */
 		ClearPageLRU(page);
 		ret = 0;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 7/9] rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

Prototypes for the proposed rcu_get_gp_cookie() / rcu_gp_cookie_elapsed()
functionality as discussed in http://marc.info/?l=linux-mm&m=131316547914194

(This is not a correct implementation of the proposed API;
Paul McKenney is to provide that as a follow-up to this RFC)

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/rcupdate.h |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 99f9aa7..315a0f1 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -865,4 +865,39 @@ static inline void __rcu_reclaim(struct rcu_head *head)
 #define kfree_rcu(ptr, rcu_head)					\
 	__kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head))
 
+struct rcu_cookie {long x; };
+
+/*
+ * rcu_get_gp_cookie() / rcu_gp_cookie_elapsed()
+ *
+ * rcu_get_gp_cookie() stores an opaque cookie to the provided location.
+ *
+ * rcu_gp_cookie_elapsed() returns true if it can guarantee that
+ * an rcu grace period has elapsed since the provided cookie was
+ * created by rcu_get_gp_cookie(). A return value of false indicates
+ * that rcu_gp_cookie_elapsed() does not know if an rcu grace period has
+ * elapsed or not - the call site is then expected to drop locks as desired,
+ * call rcu_synchronize(), and retry.
+ *
+ * Note that call sites are allowed to assume that rcu_gp_cookie_elapsed()
+ * will return true if they try enough times. An implementation that always
+ * returns false would be incorrect.
+ *
+ * The implementation below is also incorrect (may return false positives),
+ * however it does test that one always calls rcu_get_gp_cookie() before
+ * rcu_gp_cookie_elapsed() and that rcu_gp_cookie_elapsed() call sites
+ * are ready to handle both possible cases.
+ */
+static inline void rcu_get_gp_cookie(struct rcu_cookie *rcp)
+{
+	rcp->x = 12345678;
+}
+
+static inline bool rcu_gp_cookie_elapsed(struct rcu_cookie *rcp)
+{
+	static int count = 0;
+	BUG_ON(rcp->x != 12345678);
+	return (count++ % 16) != 0;
+}
+
 #endif /* __LINUX_RCUPDATE_H */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 7/9] rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

Prototypes for the proposed rcu_get_gp_cookie() / rcu_gp_cookie_elapsed()
functionality as discussed in http://marc.info/?l=linux-mm&m=131316547914194

(This is not a correct implementation of the proposed API;
Paul McKenney is to provide that as a follow-up to this RFC)

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/rcupdate.h |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 99f9aa7..315a0f1 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -865,4 +865,39 @@ static inline void __rcu_reclaim(struct rcu_head *head)
 #define kfree_rcu(ptr, rcu_head)					\
 	__kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head))
 
+struct rcu_cookie {long x; };
+
+/*
+ * rcu_get_gp_cookie() / rcu_gp_cookie_elapsed()
+ *
+ * rcu_get_gp_cookie() stores an opaque cookie to the provided location.
+ *
+ * rcu_gp_cookie_elapsed() returns true if it can guarantee that
+ * an rcu grace period has elapsed since the provided cookie was
+ * created by rcu_get_gp_cookie(). A return value of false indicates
+ * that rcu_gp_cookie_elapsed() does not know if an rcu grace period has
+ * elapsed or not - the call site is then expected to drop locks as desired,
+ * call rcu_synchronize(), and retry.
+ *
+ * Note that call sites are allowed to assume that rcu_gp_cookie_elapsed()
+ * will return true if they try enough times. An implementation that always
+ * returns false would be incorrect.
+ *
+ * The implementation below is also incorrect (may return false positives),
+ * however it does test that one always calls rcu_get_gp_cookie() before
+ * rcu_gp_cookie_elapsed() and that rcu_gp_cookie_elapsed() call sites
+ * are ready to handle both possible cases.
+ */
+static inline void rcu_get_gp_cookie(struct rcu_cookie *rcp)
+{
+	rcp->x = 12345678;
+}
+
+static inline bool rcu_gp_cookie_elapsed(struct rcu_cookie *rcp)
+{
+	static int count = 0;
+	BUG_ON(rcp->x != 12345678);
+	return (count++ % 16) != 0;
+}
+
 #endif /* __LINUX_RCUPDATE_H */
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 8/9] mm: add API for setting a grace period cookie on compound pages
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

This commit adds the page_get_gp_cookie() / page_gp_cookie_elapsed()
API to be used on compound pages. page_get_gp_cookie() sets a cookie
on a page and page_gp_cookie_elapsed() returns true if an rcu grace
period has elapsed since the cookie was set.

page_clear_gp_cookie() is called before freeing compound pages so that
their state is always returned to a given standard (as enforced by
free_pages_check() in mm/page_alloc.c)

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h       |   22 ++++++++++++++++++++++
 include/linux/mm_types.h |    6 +++++-
 mm/page_alloc.c          |    1 +
 3 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9ff5f2d..2649b59 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -494,6 +494,28 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].lru.prev = (void *)order;
 }
 
+static inline void page_get_gp_cookie(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	rcu_get_gp_cookie(&page[1].thp_create_timestamp);
+}
+
+static inline bool page_gp_cookie_elapsed(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	return rcu_gp_cookie_elapsed(&page[1].thp_create_timestamp);
+}
+
+static inline void page_clear_gp_cookie(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(offsetof(struct page, thp_create_timestamp) !=
+		  offsetof(struct page, mapping));
+	VM_BUG_ON(sizeof(page->thp_create_timestamp) !=
+		  sizeof(page->mapping));
+	page[1].mapping = 0;
+}
+
 #ifdef CONFIG_MMU
 /*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 027935c..a6f99aa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/rcupdate.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -66,7 +67,10 @@ struct page {
 	    spinlock_t ptl;
 #endif
 	    struct kmem_cache *slab;	/* SLUB: Pointer to slab */
-	    struct page *first_page;	/* Compound tail pages */
+	    struct {	/* Compound tail pages */
+		struct page *first_page;
+		struct rcu_cookie thp_create_timestamp;
+	    };
 	};
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..dc42355 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -342,6 +342,7 @@ out:
 
 static void free_compound_page(struct page *page)
 {
+	page_clear_gp_cookie(page);
 	__free_pages_ok(page, compound_order(page));
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 8/9] mm: add API for setting a grace period cookie on compound pages
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

This commit adds the page_get_gp_cookie() / page_gp_cookie_elapsed()
API to be used on compound pages. page_get_gp_cookie() sets a cookie
on a page and page_gp_cookie_elapsed() returns true if an rcu grace
period has elapsed since the cookie was set.

page_clear_gp_cookie() is called before freeing compound pages so that
their state is always returned to a given standard (as enforced by
free_pages_check() in mm/page_alloc.c)

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h       |   22 ++++++++++++++++++++++
 include/linux/mm_types.h |    6 +++++-
 mm/page_alloc.c          |    1 +
 3 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9ff5f2d..2649b59 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -494,6 +494,28 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].lru.prev = (void *)order;
 }
 
+static inline void page_get_gp_cookie(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	rcu_get_gp_cookie(&page[1].thp_create_timestamp);
+}
+
+static inline bool page_gp_cookie_elapsed(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	return rcu_gp_cookie_elapsed(&page[1].thp_create_timestamp);
+}
+
+static inline void page_clear_gp_cookie(struct page *page)
+{
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(offsetof(struct page, thp_create_timestamp) !=
+		  offsetof(struct page, mapping));
+	VM_BUG_ON(sizeof(page->thp_create_timestamp) !=
+		  sizeof(page->mapping));
+	page[1].mapping = 0;
+}
+
 #ifdef CONFIG_MMU
 /*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 027935c..a6f99aa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/rcupdate.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -66,7 +67,10 @@ struct page {
 	    spinlock_t ptl;
 #endif
 	    struct kmem_cache *slab;	/* SLUB: Pointer to slab */
-	    struct page *first_page;	/* Compound tail pages */
+	    struct {	/* Compound tail pages */
+		struct page *first_page;
+		struct rcu_cookie thp_create_timestamp;
+	    };
 	};
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e8985a..dc42355 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -342,6 +342,7 @@ out:
 
 static void free_compound_page(struct page *page)
 {
+	page_clear_gp_cookie(page);
 	__free_pages_ok(page, compound_order(page));
 }
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 9/9] mm: make sure tail page counts are stable before splitting THP pages
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:48   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

As described in the page_cache_get_speculative() comment
in pagemap.h, the count of all pages coming out of the allocator
must be considered unstable unless an RCU grace period has passed
since the pages were allocated.

This is an issue for THP because __split_huge_page_refcount()
depends on tail page counts being stable.

By setting a cookie on THP pages when they are allocated, we are able
to ensure the tail page counts are stable before splitting such pages.
In the typical case, the THP page should be old enough by the time we
try to split it, so that we won't have to wait.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/huge_memory.c |   33 +++++++++++++++++++++++++++++----
 1 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81532f2..46c0c0b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -657,15 +657,23 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 					      unsigned long haddr, int nd,
 					      gfp_t extra_gfp)
 {
-	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
+	struct page *page;
+	page = alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
 			       HPAGE_PMD_ORDER, vma, haddr, nd);
+	if (page)
+		page_get_gp_cookie(page);
+	return page;
 }
 
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
+	struct page *page;
+	page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
 			   HPAGE_PMD_ORDER);
+	if (page)
+		page_get_gp_cookie(page);
+	return page;
 }
 #endif
 
@@ -1209,7 +1217,7 @@ static void __split_huge_page_refcount(struct page *page)
 		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
+		BUG_ON(page_tail->mapping);  /* see page_clear_gp_cookie() */
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = ++head_index;
@@ -1387,9 +1395,11 @@ static void __split_huge_page(struct page *page,
 int split_huge_page(struct page *page)
 {
 	struct anon_vma *anon_vma;
-	int ret = 1;
+	int ret;
 
+retry:
 	BUG_ON(!PageAnon(page));
+	ret = 1;
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
 		goto out;
@@ -1397,6 +1407,21 @@ int split_huge_page(struct page *page)
 	if (!PageCompound(page))
 		goto out_unlock;
 
+	/*
+	 * Make sure the tail page counts are stable before splitting the page.
+	 * See the page_cache_get_speculative() comment in pagemap.h.
+	 */
+	if (!page_gp_cookie_elapsed(page)) {
+		page_unlock_anon_vma(anon_vma);
+		synchronize_rcu();
+		goto retry;
+	}
+
+	/*
+	 * Make sure page_tail->mapping is cleared before we split up the page.
+	 */
+	page_clear_gp_cookie(page);
+
 	BUG_ON(!PageSwapBacked(page));
 	__split_huge_page(page, anon_vma);
 	count_vm_event(THP_SPLIT);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 9/9] mm: make sure tail page counts are stable before splitting THP pages
@ 2011-08-19  7:48   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:48 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

As described in the page_cache_get_speculative() comment
in pagemap.h, the count of all pages coming out of the allocator
must be considered unstable unless an RCU grace period has passed
since the pages were allocated.

This is an issue for THP because __split_huge_page_refcount()
depends on tail page counts being stable.

By setting a cookie on THP pages when they are allocated, we are able
to ensure the tail page counts are stable before splitting such pages.
In the typical case, the THP page should be old enough by the time we
try to split it, so that we won't have to wait.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/huge_memory.c |   33 +++++++++++++++++++++++++++++----
 1 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81532f2..46c0c0b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -657,15 +657,23 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 					      unsigned long haddr, int nd,
 					      gfp_t extra_gfp)
 {
-	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
+	struct page *page;
+	page = alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
 			       HPAGE_PMD_ORDER, vma, haddr, nd);
+	if (page)
+		page_get_gp_cookie(page);
+	return page;
 }
 
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
+	struct page *page;
+	page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
 			   HPAGE_PMD_ORDER);
+	if (page)
+		page_get_gp_cookie(page);
+	return page;
 }
 #endif
 
@@ -1209,7 +1217,7 @@ static void __split_huge_page_refcount(struct page *page)
 		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
+		BUG_ON(page_tail->mapping);  /* see page_clear_gp_cookie() */
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = ++head_index;
@@ -1387,9 +1395,11 @@ static void __split_huge_page(struct page *page,
 int split_huge_page(struct page *page)
 {
 	struct anon_vma *anon_vma;
-	int ret = 1;
+	int ret;
 
+retry:
 	BUG_ON(!PageAnon(page));
+	ret = 1;
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
 		goto out;
@@ -1397,6 +1407,21 @@ int split_huge_page(struct page *page)
 	if (!PageCompound(page))
 		goto out_unlock;
 
+	/*
+	 * Make sure the tail page counts are stable before splitting the page.
+	 * See the page_cache_get_speculative() comment in pagemap.h.
+	 */
+	if (!page_gp_cookie_elapsed(page)) {
+		page_unlock_anon_vma(anon_vma);
+		synchronize_rcu();
+		goto retry;
+	}
+
+	/*
+	 * Make sure page_tail->mapping is cleared before we split up the page.
+	 */
+	page_clear_gp_cookie(page);
+
 	BUG_ON(!PageSwapBacked(page));
 	__split_huge_page(page, anon_vma);
 	count_vm_event(THP_SPLIT);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH 0/9] Use RCU to stabilize page counts
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-19  7:53   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:53 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Paul E. McKenney
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

Adding Paul - I meant to have him in the original email, but git
send-email filtered him out because I forgot to add <> around his
email. DOH!

On Fri, Aug 19, 2011 at 12:48 AM, Michel Lespinasse <walken@google.com> wrote:
> include/linux/pagemap.h describes the protocol one should use to get pages
> from page cache - one can't know if the reference they get will be on the
> desired page, so newly allocated pages might see elevated reference counts,
> but using RCU this effect can be limited in time to one RCU grace period.
>
> For this protocol to work, every call site of get_page_unless_zero() has to
> participate, and this was not previously enforced.
>
> Patches 1-3 convert some get_page_unless_zero() call sites to use the proper
> RCU protocol as described in pagemap.h
>
> Patches 4-5 convert some get_page_unless_zero() call sites to just call
> get_page()
>
> Patch 6 asserts that every remaining get_page_unless_zero() call site should
> participate in the RCU protocol. Well, not actually all of them -
> __isolate_rcu_page() is exempted because it holds the zone LRU lock which
> would prevent the given page from getting entirely freed, and a few others
> related to hwpoison, memory hotplug and memory failure are exempted because
> I haven't been able to figure out what to do.
>
> Patch 7 is a placeholder for an RCU API extension we have been talking about
> with Paul McKenney. The idea is to record an initial time as an opaque cookie,
> and to be able to determine later on if an rcu grace period has elapsed since
> that initial time.
>
> Patch 8 adds wrapper functions to store an RCU cookie into compound pages.
>
> Patch 9 makes use of new RCU API, as well as the prior fixes from patches 1-6,
> to ensure tail page counts are stable while we split THP pages. This fixes a
> (rather theorical, not actually been observed) race condition where THP page
> splitting could result in incorrect page counts if THP page allocation and
> splitting both occur while another thread tries to run get_page_unless_zero
> on a single page that got re-allocated as THP tail page.
>
>
> The patches have received only a limited amount of testing; however I
> believe patches 1-6 to be sane and I would like them to get more
> exposure, maybe as part of andrew's -mm tree.
>
>
> Besides that, this proposal is also to sync up with Paul regarding the RCU
> functionality :)
>
>
> Michel Lespinasse (9):
>  mm: rcu read lock for getting reference on pages in
>    migration_entry_wait()
>  mm: avoid calling get_page_unless_zero() when charging cgroups
>  mm: rcu read lock when getting from tail to head page
>  mm: use get_page in deactivate_page()
>  kvm: use get_page instead of get_page_unless_zero
>  mm: assert that get_page_unless_zero() callers hold the rcu lock
>  rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins
>  mm: add API for setting a grace period cookie on compound pages
>  mm: make sure tail page counts are stable before splitting THP pages
>
>  arch/x86/kvm/mmu.c       |    3 +--
>  include/linux/mm.h       |   38 +++++++++++++++++++++++++++++++++++++-
>  include/linux/mm_types.h |    6 +++++-
>  include/linux/pagemap.h  |    1 +
>  include/linux/rcupdate.h |   35 +++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c         |   33 +++++++++++++++++++++++++++++----
>  mm/hwpoison-inject.c     |    2 +-
>  mm/ksm.c                 |    4 ++++
>  mm/memcontrol.c          |   20 ++++++++++----------
>  mm/memory-failure.c      |    6 +++---
>  mm/memory_hotplug.c      |    2 +-
>  mm/migrate.c             |    3 +++
>  mm/page_alloc.c          |    1 +
>  mm/swap.c                |   22 ++++++++++++++--------
>  mm/vmscan.c              |    7 ++++++-
>  15 files changed, 151 insertions(+), 32 deletions(-)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 0/9] Use RCU to stabilize page counts
@ 2011-08-19  7:53   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-19  7:53 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Paul E. McKenney
  Cc: Andrea Arcangeli, Hugh Dickins, Minchan Kim, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li

Adding Paul - I meant to have him in the original email, but git
send-email filtered him out because I forgot to add <> around his
email. DOH!

On Fri, Aug 19, 2011 at 12:48 AM, Michel Lespinasse <walken@google.com> wrote:
> include/linux/pagemap.h describes the protocol one should use to get pages
> from page cache - one can't know if the reference they get will be on the
> desired page, so newly allocated pages might see elevated reference counts,
> but using RCU this effect can be limited in time to one RCU grace period.
>
> For this protocol to work, every call site of get_page_unless_zero() has to
> participate, and this was not previously enforced.
>
> Patches 1-3 convert some get_page_unless_zero() call sites to use the proper
> RCU protocol as described in pagemap.h
>
> Patches 4-5 convert some get_page_unless_zero() call sites to just call
> get_page()
>
> Patch 6 asserts that every remaining get_page_unless_zero() call site should
> participate in the RCU protocol. Well, not actually all of them -
> __isolate_rcu_page() is exempted because it holds the zone LRU lock which
> would prevent the given page from getting entirely freed, and a few others
> related to hwpoison, memory hotplug and memory failure are exempted because
> I haven't been able to figure out what to do.
>
> Patch 7 is a placeholder for an RCU API extension we have been talking about
> with Paul McKenney. The idea is to record an initial time as an opaque cookie,
> and to be able to determine later on if an rcu grace period has elapsed since
> that initial time.
>
> Patch 8 adds wrapper functions to store an RCU cookie into compound pages.
>
> Patch 9 makes use of new RCU API, as well as the prior fixes from patches 1-6,
> to ensure tail page counts are stable while we split THP pages. This fixes a
> (rather theorical, not actually been observed) race condition where THP page
> splitting could result in incorrect page counts if THP page allocation and
> splitting both occur while another thread tries to run get_page_unless_zero
> on a single page that got re-allocated as THP tail page.
>
>
> The patches have received only a limited amount of testing; however I
> believe patches 1-6 to be sane and I would like them to get more
> exposure, maybe as part of andrew's -mm tree.
>
>
> Besides that, this proposal is also to sync up with Paul regarding the RCU
> functionality :)
>
>
> Michel Lespinasse (9):
>  mm: rcu read lock for getting reference on pages in
>    migration_entry_wait()
>  mm: avoid calling get_page_unless_zero() when charging cgroups
>  mm: rcu read lock when getting from tail to head page
>  mm: use get_page in deactivate_page()
>  kvm: use get_page instead of get_page_unless_zero
>  mm: assert that get_page_unless_zero() callers hold the rcu lock
>  rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins
>  mm: add API for setting a grace period cookie on compound pages
>  mm: make sure tail page counts are stable before splitting THP pages
>
>  arch/x86/kvm/mmu.c       |    3 +--
>  include/linux/mm.h       |   38 +++++++++++++++++++++++++++++++++++++-
>  include/linux/mm_types.h |    6 +++++-
>  include/linux/pagemap.h  |    1 +
>  include/linux/rcupdate.h |   35 +++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c         |   33 +++++++++++++++++++++++++++++----
>  mm/hwpoison-inject.c     |    2 +-
>  mm/ksm.c                 |    4 ++++
>  mm/memcontrol.c          |   20 ++++++++++----------
>  mm/memory-failure.c      |    6 +++---
>  mm/memory_hotplug.c      |    2 +-
>  mm/migrate.c             |    3 +++
>  mm/page_alloc.c          |    1 +
>  mm/swap.c                |   22 ++++++++++++++--------
>  mm/vmscan.c              |    7 ++++++-
>  15 files changed, 151 insertions(+), 32 deletions(-)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 6/9] mm: assert that get_page_unless_zero() callers hold the rcu lock
  2011-08-19  7:48   ` Michel Lespinasse
@ 2011-08-19 23:28     ` Andi Kleen
  -1 siblings, 0 replies; 109+ messages in thread
From: Andi Kleen @ 2011-08-19 23:28 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Hugh Dickins, Minchan Kim, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li

Michel Lespinasse <walken@google.com> writes:
>
> Other call sites in memory_hotplug.c, memory_failure.c and hwpoison-inject.c
> are also exempted. It would be preferable if someone more familiar with

I see no reason why hwpoison-inject needs to be exempted. If it doesn't
hold rcu read lock it should.
-Andi

 

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 6/9] mm: assert that get_page_unless_zero() callers hold the rcu lock
@ 2011-08-19 23:28     ` Andi Kleen
  0 siblings, 0 replies; 109+ messages in thread
From: Andi Kleen @ 2011-08-19 23:28 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Hugh Dickins, Minchan Kim, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li

Michel Lespinasse <walken@google.com> writes:
>
> Other call sites in memory_hotplug.c, memory_failure.c and hwpoison-inject.c
> are also exempted. It would be preferable if someone more familiar with

I see no reason why hwpoison-inject needs to be exempted. If it doesn't
hold rcu read lock it should.
-Andi

 

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix
  2011-08-19  7:48 ` Michel Lespinasse
@ 2011-08-22 21:33   ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-22 21:33 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li

Hi Michal,

I had proper time today to think about this issue and focusing more on
what the problem really is I think I found a simpler way to fix it. I
also found another maybe even smaller race in direct-io which I hope
this fixes too.

Fixing this was already in my top priority, but I wanted to obtain
proof that the knumad driving the scheduler was working as well as
hard numa bindings before KVMForum.

So this solution:

1) should allow the working set estimation code to keep doing its
get_page_unless_zero() without any change (you'll still have to modify
it to check if you got a THP page etc... but you won't risk to get any
tail page anymore). Maybe it still needs some non trivial thought
about the changes but not anymore about tail pages refcounting screwups.

2) no change to all existing get_page_unless_zero() is required, so
this should fix the radix tree speculative page lookup too.

3) no RCU new feature is needed

4) get_page was actually called by direct-io as my debug
instrumentation I wrote to test these changes noticed it so I fixed
that too

3.1.0-rc for me will crash at boot, I think it's broken and it doesn't
boot unless one has an initrd which I never have so I did all testing
on 3.0.0 and the patch is against that too.

I'd like if you could review, it's still a bit too early to be sure it
works but my torture testing is going on without much problems so far
(a loop of dd if=/dev/zero of=/dev/null bs=10M iflag=direct plus heavy
swapping of THP splitting in a loop and KVM).

===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -355,38 +355,80 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, like _count, starts from -1:
+ * so that transitions both from it and to it can be tracked,
+ * using atomic_inc_and_test and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline void get_page(struct page *page)
+static inline void __get_page_tail_foll(struct page *page)
 {
 	/*
-	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
-	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+extern int __get_page_tail(struct page *page);
+
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
 		/*
 		 * This is safe only because
 		 * __split_huge_page_refcount can't run under
-		 * get_page().
+		 * get_page_foll() because we hold the proper PT lock.
 		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
+		__get_page_tail_foll(page);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
 	}
 }
 
+static inline void get_page(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		if (__get_page_tail(page))
+			return;
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
+	atomic_inc(&page->_count);
+}
+
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
@@ -803,21 +845,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1164,11 +1164,13 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		atomic_sub(page_mapcount(page_tail), &page->_count);
+		BUG_ON(atomic_read(&page->_count) <= 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
 		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		atomic_add(page_mapcount(page_tail), &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1206,7 +1208,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1514,7 +1514,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -128,9 +128,10 @@ static void put_compound_page(struct pag
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +161,32 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *head_page = compound_trans_head(page);
+	if (likely(page != head_page)) {
+		flags = compound_lock_irqsave(head_page);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page);
+			got = 1;
+		}
+		compound_unlock_irqrestore(head_page, flags);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix
@ 2011-08-22 21:33   ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-22 21:33 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li

Hi Michal,

I had proper time today to think about this issue and focusing more on
what the problem really is I think I found a simpler way to fix it. I
also found another maybe even smaller race in direct-io which I hope
this fixes too.

Fixing this was already in my top priority, but I wanted to obtain
proof that the knumad driving the scheduler was working as well as
hard numa bindings before KVMForum.

So this solution:

1) should allow the working set estimation code to keep doing its
get_page_unless_zero() without any change (you'll still have to modify
it to check if you got a THP page etc... but you won't risk to get any
tail page anymore). Maybe it still needs some non trivial thought
about the changes but not anymore about tail pages refcounting screwups.

2) no change to all existing get_page_unless_zero() is required, so
this should fix the radix tree speculative page lookup too.

3) no RCU new feature is needed

4) get_page was actually called by direct-io as my debug
instrumentation I wrote to test these changes noticed it so I fixed
that too

3.1.0-rc for me will crash at boot, I think it's broken and it doesn't
boot unless one has an initrd which I never have so I did all testing
on 3.0.0 and the patch is against that too.

I'd like if you could review, it's still a bit too early to be sure it
works but my torture testing is going on without much problems so far
(a loop of dd if=/dev/zero of=/dev/null bs=10M iflag=direct plus heavy
swapping of THP splitting in a loop and KVM).

===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -355,38 +355,80 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, like _count, starts from -1:
+ * so that transitions both from it and to it can be tracked,
+ * using atomic_inc_and_test and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline void get_page(struct page *page)
+static inline void __get_page_tail_foll(struct page *page)
 {
 	/*
-	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
-	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+extern int __get_page_tail(struct page *page);
+
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
 		/*
 		 * This is safe only because
 		 * __split_huge_page_refcount can't run under
-		 * get_page().
+		 * get_page_foll() because we hold the proper PT lock.
 		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
+		__get_page_tail_foll(page);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
 	}
 }
 
+static inline void get_page(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		if (__get_page_tail(page))
+			return;
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
+	atomic_inc(&page->_count);
+}
+
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
@@ -803,21 +845,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1164,11 +1164,13 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		atomic_sub(page_mapcount(page_tail), &page->_count);
+		BUG_ON(atomic_read(&page->_count) <= 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
 		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		atomic_add(page_mapcount(page_tail), &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1206,7 +1208,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1514,7 +1514,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -128,9 +128,10 @@ static void put_compound_page(struct pag
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +161,32 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *head_page = compound_trans_head(page);
+	if (likely(page != head_page)) {
+		flags = compound_lock_irqsave(head_page);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page);
+			got = 1;
+		}
+		compound_unlock_irqrestore(head_page, flags);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
  2011-08-22 21:33   ` Andrea Arcangeli
@ 2011-08-23 14:55     ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-23 14:55 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Chris Mason

Hello everyone,

[ Chris CC'ed ]

Chris, could you run this patch on one of yours hugetlbfs benchmarks to
verify there is no scalability issue in the spinlock of
__get_page_tail in your environment?

With this patch O_DIRECT on hugetlbfs will take the compound_lock even
if it doesn't need to for hugetlbfs. We could have made this special
for THP only, and in fact we could avoid the "tail_page" reference
counting completely for tail pages with hugetlbfs, but we didn't do
that so far because it's safer to have a single code path for all
compound pages and not make hugetlbfs special case there and have all
compound pages behave the same regardless if it's THP or hugetlbfs or
some driver allocating/freeing it, it keeps things much simpler and
the overhead AFIK is unmeasurable (I ask to benchmark just to be 100%
sure). The refcounting path is tricky and the more testing the better
so if it's the same for all compound pages it's best in my view (at
least for the mid term).

Also note, even if we were to ultraoptimize hugetlbfs the SMP locking
scalability would remain identical because the atomic_inc would be
still needed on the "head" page which is the only "shared" item. The
"superflous" for hugetlbfs is _only_ the write on the
"head_page->flags" locked op, and the tail_page->_mapcount atomic
inc/dec. We can't possibly eliminate the head_page->_count atomic
inc/dec, even if we were to ultraoptimize for hugetlbfs and that's the
only possibly troublesome bit in terms of SMP scalability (the
tail_page is so finegrined it can't be a scalability issue). This is
why I exclude these changes can be measurable and they should work as
great as before.

Also it'd be nice if somebody would look in direct-io to stop doing
that get_page there and relay only on the reference taken by
get_user_pages (KVM and all other get_user_page users never take
additional references on pages returned by get_user_pages and relays
exclusively on the refcount taken by get_user_pages). get_user_pages
is capable of taking the reference on the tail pages without having to
take the compound lock perfectly safely through get_page_foll() (which
is taken while the page_table_lock is still held, so it automatically
serializes with split_huge_page through the page_table_lock). Only
additional get_page(page[i]) done on the tail pages taken with
get_user_pages requires the compound_lock to be safe vs
split_huge_page (and normally there is no way to ever call get_page on
any tail page, only after get_user_pages you could run into that, so
that can be usually easily avoided, and so the compound_lock can be
better optimized away by not calling get_page on tail pages in the
first place which also avoids all other locked ops of get_page, not
just the compound_lock!).

Also note, the compound_lock was already taken for the put_page of the
tail pages. We only could avoid it in get_user_pages. So this isn't
making things much difference. I still kept my priority to avoid any
comound_lock for head pages. head compound pages, or regular pages
(not compound) just have 1 atomic ops like always. That is way more
important for performance I think because the head page refcounting
and regular page refcounting is in the real CPU bound fast paths (not
more I/O bound dealing with pci devices like O_DIRECT). And I still
also avoid compound_lock for the secondary MMU page fault in
get_user_pages (can't avoid it in put_page run after the spte is
established, just no way around it but it's always been like that).

So I loaded this patch on all my systems so far so good, torture
testing is also going without problems.

a git tree including patch is here.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary

This is the actual patch you can apply to stock 3.0.0 (3.1 won't boot
here for me because I never use initrd on my development systems...).

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=41dc8190934cea22b8c8b3f89e82052610664fbb

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
@ 2011-08-23 14:55     ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-23 14:55 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Chris Mason

Hello everyone,

[ Chris CC'ed ]

Chris, could you run this patch on one of yours hugetlbfs benchmarks to
verify there is no scalability issue in the spinlock of
__get_page_tail in your environment?

With this patch O_DIRECT on hugetlbfs will take the compound_lock even
if it doesn't need to for hugetlbfs. We could have made this special
for THP only, and in fact we could avoid the "tail_page" reference
counting completely for tail pages with hugetlbfs, but we didn't do
that so far because it's safer to have a single code path for all
compound pages and not make hugetlbfs special case there and have all
compound pages behave the same regardless if it's THP or hugetlbfs or
some driver allocating/freeing it, it keeps things much simpler and
the overhead AFIK is unmeasurable (I ask to benchmark just to be 100%
sure). The refcounting path is tricky and the more testing the better
so if it's the same for all compound pages it's best in my view (at
least for the mid term).

Also note, even if we were to ultraoptimize hugetlbfs the SMP locking
scalability would remain identical because the atomic_inc would be
still needed on the "head" page which is the only "shared" item. The
"superflous" for hugetlbfs is _only_ the write on the
"head_page->flags" locked op, and the tail_page->_mapcount atomic
inc/dec. We can't possibly eliminate the head_page->_count atomic
inc/dec, even if we were to ultraoptimize for hugetlbfs and that's the
only possibly troublesome bit in terms of SMP scalability (the
tail_page is so finegrined it can't be a scalability issue). This is
why I exclude these changes can be measurable and they should work as
great as before.

Also it'd be nice if somebody would look in direct-io to stop doing
that get_page there and relay only on the reference taken by
get_user_pages (KVM and all other get_user_page users never take
additional references on pages returned by get_user_pages and relays
exclusively on the refcount taken by get_user_pages). get_user_pages
is capable of taking the reference on the tail pages without having to
take the compound lock perfectly safely through get_page_foll() (which
is taken while the page_table_lock is still held, so it automatically
serializes with split_huge_page through the page_table_lock). Only
additional get_page(page[i]) done on the tail pages taken with
get_user_pages requires the compound_lock to be safe vs
split_huge_page (and normally there is no way to ever call get_page on
any tail page, only after get_user_pages you could run into that, so
that can be usually easily avoided, and so the compound_lock can be
better optimized away by not calling get_page on tail pages in the
first place which also avoids all other locked ops of get_page, not
just the compound_lock!).

Also note, the compound_lock was already taken for the put_page of the
tail pages. We only could avoid it in get_user_pages. So this isn't
making things much difference. I still kept my priority to avoid any
comound_lock for head pages. head compound pages, or regular pages
(not compound) just have 1 atomic ops like always. That is way more
important for performance I think because the head page refcounting
and regular page refcounting is in the real CPU bound fast paths (not
more I/O bound dealing with pci devices like O_DIRECT). And I still
also avoid compound_lock for the secondary MMU page fault in
get_user_pages (can't avoid it in put_page run after the spte is
established, just no way around it but it's always been like that).

So I loaded this patch on all my systems so far so good, torture
testing is also going without problems.

a git tree including patch is here.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary

This is the actual patch you can apply to stock 3.0.0 (3.1 won't boot
here for me because I never use initrd on my development systems...).

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=41dc8190934cea22b8c8b3f89e82052610664fbb

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
  2011-08-22 21:33   ` Andrea Arcangeli
@ 2011-08-23 16:45     ` Minchan Kim
  -1 siblings, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2011-08-23 16:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Mon, Aug 22, 2011 at 11:33:47PM +0200, Andrea Arcangeli wrote:
> Hi Michal,
> 
> I had proper time today to think about this issue and focusing more on
> what the problem really is I think I found a simpler way to fix it. I
> also found another maybe even smaller race in direct-io which I hope
> this fixes too.
> 
> Fixing this was already in my top priority, but I wanted to obtain
> proof that the knumad driving the scheduler was working as well as
> hard numa bindings before KVMForum.
> 
> So this solution:
> 
> 1) should allow the working set estimation code to keep doing its
> get_page_unless_zero() without any change (you'll still have to modify
> it to check if you got a THP page etc... but you won't risk to get any
> tail page anymore). Maybe it still needs some non trivial thought
> about the changes but not anymore about tail pages refcounting screwups.
> 
> 2) no change to all existing get_page_unless_zero() is required, so
> this should fix the radix tree speculative page lookup too.
> 
> 3) no RCU new feature is needed

Nice goal.

> 
> 4) get_page was actually called by direct-io as my debug
> instrumentation I wrote to test these changes noticed it so I fixed
> that too

Nice catch.

> 
> 3.1.0-rc for me will crash at boot, I think it's broken and it doesn't
> boot unless one has an initrd which I never have so I did all testing
> on 3.0.0 and the patch is against that too.
> 
> I'd like if you could review, it's still a bit too early to be sure it
> works but my torture testing is going on without much problems so far
> (a loop of dd if=/dev/zero of=/dev/null bs=10M iflag=direct plus heavy
> swapping of THP splitting in a loop and KVM).
> 
> ===
> Subject: thp: tail page refcounting fix
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
> 
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).

Nice idea!

> 
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
> 
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

The code looks good to me.

The nitpick is about naming 'foll'.
What does it mean? 'follow'?
If it is, I hope we use full name.
Regardless of renaming it, I am okay the patch.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
@ 2011-08-23 16:45     ` Minchan Kim
  0 siblings, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2011-08-23 16:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Mon, Aug 22, 2011 at 11:33:47PM +0200, Andrea Arcangeli wrote:
> Hi Michal,
> 
> I had proper time today to think about this issue and focusing more on
> what the problem really is I think I found a simpler way to fix it. I
> also found another maybe even smaller race in direct-io which I hope
> this fixes too.
> 
> Fixing this was already in my top priority, but I wanted to obtain
> proof that the knumad driving the scheduler was working as well as
> hard numa bindings before KVMForum.
> 
> So this solution:
> 
> 1) should allow the working set estimation code to keep doing its
> get_page_unless_zero() without any change (you'll still have to modify
> it to check if you got a THP page etc... but you won't risk to get any
> tail page anymore). Maybe it still needs some non trivial thought
> about the changes but not anymore about tail pages refcounting screwups.
> 
> 2) no change to all existing get_page_unless_zero() is required, so
> this should fix the radix tree speculative page lookup too.
> 
> 3) no RCU new feature is needed

Nice goal.

> 
> 4) get_page was actually called by direct-io as my debug
> instrumentation I wrote to test these changes noticed it so I fixed
> that too

Nice catch.

> 
> 3.1.0-rc for me will crash at boot, I think it's broken and it doesn't
> boot unless one has an initrd which I never have so I did all testing
> on 3.0.0 and the patch is against that too.
> 
> I'd like if you could review, it's still a bit too early to be sure it
> works but my torture testing is going on without much problems so far
> (a loop of dd if=/dev/zero of=/dev/null bs=10M iflag=direct plus heavy
> swapping of THP splitting in a loop and KVM).
> 
> ===
> Subject: thp: tail page refcounting fix
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
> 
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).

Nice idea!

> 
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
> 
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

The code looks good to me.

The nitpick is about naming 'foll'.
What does it mean? 'follow'?
If it is, I hope we use full name.
Regardless of renaming it, I am okay the patch.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
  2011-08-23 16:45     ` Minchan Kim
@ 2011-08-23 16:54       ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-23 16:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Wed, Aug 24, 2011 at 01:45:15AM +0900, Minchan Kim wrote:
> Nice idea!

Thanks! It felt natural to account the tail refcounts in
page_tail->_count, so they were already there and it was enough to add
the page_mapcount(head_page) to the page_tail->_count. But there's no
particular reason we had to do the tail_page refcounting in the
->_page field before the split...

> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> The code looks good to me.

Thanks a lot for the quick review.

> The nitpick is about naming 'foll'.
> What does it mean? 'follow'?
> If it is, I hope we use full name.
> Regardless of renaming it, I am okay the patch.

Ok the name comes from FOLL_GET. Only code  paths marked by checking
FOLL_GET are allowed to call get_page_foll(). Anything else can't.

mm/*memory.c:

	 if (flags & FOLL_GET)
	    get_page_foll(page);

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
@ 2011-08-23 16:54       ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-23 16:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Wed, Aug 24, 2011 at 01:45:15AM +0900, Minchan Kim wrote:
> Nice idea!

Thanks! It felt natural to account the tail refcounts in
page_tail->_count, so they were already there and it was enough to add
the page_mapcount(head_page) to the page_tail->_count. But there's no
particular reason we had to do the tail_page refcounting in the
->_page field before the split...

> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> The code looks good to me.

Thanks a lot for the quick review.

> The nitpick is about naming 'foll'.
> What does it mean? 'follow'?
> If it is, I hope we use full name.
> Regardless of renaming it, I am okay the patch.

Ok the name comes from FOLL_GET. Only code  paths marked by checking
FOLL_GET are allowed to call get_page_foll(). Anything else can't.

mm/*memory.c:

	 if (flags & FOLL_GET)
	    get_page_foll(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
  2011-08-22 21:33   ` Andrea Arcangeli
@ 2011-08-23 19:52     ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-23 19:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

Hi Andrea,

On Mon, Aug 22, 2011 at 2:33 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> So this solution:
>
> 1) should allow the working set estimation code to keep doing its
> get_page_unless_zero() without any change (you'll still have to modify
> it to check if you got a THP page etc... but you won't risk to get any
> tail page anymore). Maybe it still needs some non trivial thought
> about the changes but not anymore about tail pages refcounting screwups.
>
> 2) no change to all existing get_page_unless_zero() is required, so
> this should fix the radix tree speculative page lookup too.
>
> 3) no RCU new feature is needed

Adding Paul McKenney so he won't spend too much time on RCU cookie
feature until there is a firmer user...

> 4) get_page was actually called by direct-io as my debug
> instrumentation I wrote to test these changes noticed it so I fixed
> that too

Looks like this scheme will work. I'm off in Yosemite for a few days
with my family, but I should be able to review this more thoroughly on
Thursday.

>From a few-minutes look, I have a few minor concerns:
- When splitting THP pages, the old tail refcount will be visible as
the _mapcount for a short while after PageTail is cleared; not clear
yet to me if there are unintended side effects to that;
- (not a concern, but an opportunity) when splitting pages, there are
two atomic adds to the tail _count field, while we know the initial
value is 0. Why not just one straight assignment ? Similarly, the
adjustments to page head count could be added into a local variable
and the page head count could be updated once after all tail pages
have been split off.
- Not sure if we could/should add assertions to make sure people call
the right get_page variant.

The other question I have is about the use of the pagemap.h RCU
protocol for eventual page count stability. With your proposal, this
would now affect only head pages, so THP splitting is fine :) . I'm
not sure who else might use that protocol, but it looks like we should
either make all get_pages_unless_zero call sites follow it (if the
protocol matters to someone) or none (if the protocol turns out to be
obsolete).

Sorry for the incomplete reply, I'll have a better one by Thursday :)

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
@ 2011-08-23 19:52     ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-23 19:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

Hi Andrea,

On Mon, Aug 22, 2011 at 2:33 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> So this solution:
>
> 1) should allow the working set estimation code to keep doing its
> get_page_unless_zero() without any change (you'll still have to modify
> it to check if you got a THP page etc... but you won't risk to get any
> tail page anymore). Maybe it still needs some non trivial thought
> about the changes but not anymore about tail pages refcounting screwups.
>
> 2) no change to all existing get_page_unless_zero() is required, so
> this should fix the radix tree speculative page lookup too.
>
> 3) no RCU new feature is needed

Adding Paul McKenney so he won't spend too much time on RCU cookie
feature until there is a firmer user...

> 4) get_page was actually called by direct-io as my debug
> instrumentation I wrote to test these changes noticed it so I fixed
> that too

Looks like this scheme will work. I'm off in Yosemite for a few days
with my family, but I should be able to review this more thoroughly on
Thursday.

>From a few-minutes look, I have a few minor concerns:
- When splitting THP pages, the old tail refcount will be visible as
the _mapcount for a short while after PageTail is cleared; not clear
yet to me if there are unintended side effects to that;
- (not a concern, but an opportunity) when splitting pages, there are
two atomic adds to the tail _count field, while we know the initial
value is 0. Why not just one straight assignment ? Similarly, the
adjustments to page head count could be added into a local variable
and the page head count could be updated once after all tail pages
have been split off.
- Not sure if we could/should add assertions to make sure people call
the right get_page variant.

The other question I have is about the use of the pagemap.h RCU
protocol for eventual page count stability. With your proposal, this
would now affect only head pages, so THP splitting is fine :) . I'm
not sure who else might use that protocol, but it looks like we should
either make all get_pages_unless_zero call sites follow it (if the
protocol matters to someone) or none (if the protocol turns out to be
obsolete).

Sorry for the incomplete reply, I'll have a better one by Thursday :)

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
  2011-08-23 19:52     ` Michel Lespinasse
@ 2011-08-24  0:09       ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-24  0:09 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

Hi Michel,

On Tue, Aug 23, 2011 at 12:52:56PM -0700, Michel Lespinasse wrote:
> Adding Paul McKenney so he won't spend too much time on RCU cookie
> feature until there is a firmer user...

Yep, he already knew because I notified him privately for the same
reason.

> Looks like this scheme will work. I'm off in Yosemite for a few days
> with my family, but I should be able to review this more thoroughly on
> Thursday.

Take your time, and enjoy Yosemite :).

> From a few-minutes look, I have a few minor concerns:
> - When splitting THP pages, the old tail refcount will be visible as
> the _mapcount for a short while after PageTail is cleared; not clear
> yet to me if there are unintended side effects to that;

Well it was zero before and that was also wrong it is overwritten
later with the right value well after PageTail is cleared, so it's ok
if previous code was ok. All ptes are set as pmd_trans_splitting so
nothing should mess page_tail->_mapcount it as no mapping can be
created or go away for the duration of the split and regardless any
mapping that exists only exists for the pmd and the head page (tail
pages are invisible to rmap until later).

> - (not a concern, but an opportunity) when splitting pages, there are
> two atomic adds to the tail _count field, while we know the initial
> value is 0. Why not just one straight assignment ? Similarly, the
> adjustments to page head count could be added into a local variable
> and the page head count could be updated once after all tail pages
> have been split off.

That's an optimization I can look into agreed. I guess I just added
one line and not even think too much at optimizing this,
split_huge_page isn't in a fast path.

> - Not sure if we could/should add assertions to make sure people call
> the right get_page variant.

Not right now or it'd flood when anybody uses O_DIRECT. If O_DIRECT
gets fixes to stop doing this, it sounds definitely good idea.

I already tried adding a printk to the got=1 path and it floods with a
128M/sec dd bs=10M iflag=direct transfer.

> The other question I have is about the use of the pagemap.h RCU
> protocol for eventual page count stability. With your proposal, this
> would now affect only head pages, so THP splitting is fine :) . I'm
> not sure who else might use that protocol, but it looks like we should
> either make all get_pages_unless_zero call sites follow it (if the
> protocol matters to someone) or none (if the protocol turns out to be
> obsolete).

I don't see who is using synchronize_rcu to stabilize the page count
so at first sight it seems superfluous there too. Maybe it was a "if
anybody will ever need to stabilize the page count this can be
used". The only calls of synchronize_rcu in mm/* are in memcg and in
mmu notifier which is not meant to synchronize the page count but just
to walk the mmu notifier registration list lockless from the mm
struct.

I guess we need to ask who wrote that function for clarifications on
the page count stabilization. And if one needs really to stabilize the
page count he will also need Paul's rcu_sequence_t feature to do it
really efficiently (which is now on hold, so if that synchronize_rcu
caller really exists that would likely also mean we need
rcu_sequence_t to optimize it properly). My current feeling is if one
needs that feature he's doing something's wrong that could be achieved
somewhere else but I may be biased by the fact this one worked out.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
@ 2011-08-24  0:09       ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-24  0:09 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

Hi Michel,

On Tue, Aug 23, 2011 at 12:52:56PM -0700, Michel Lespinasse wrote:
> Adding Paul McKenney so he won't spend too much time on RCU cookie
> feature until there is a firmer user...

Yep, he already knew because I notified him privately for the same
reason.

> Looks like this scheme will work. I'm off in Yosemite for a few days
> with my family, but I should be able to review this more thoroughly on
> Thursday.

Take your time, and enjoy Yosemite :).

> From a few-minutes look, I have a few minor concerns:
> - When splitting THP pages, the old tail refcount will be visible as
> the _mapcount for a short while after PageTail is cleared; not clear
> yet to me if there are unintended side effects to that;

Well it was zero before and that was also wrong it is overwritten
later with the right value well after PageTail is cleared, so it's ok
if previous code was ok. All ptes are set as pmd_trans_splitting so
nothing should mess page_tail->_mapcount it as no mapping can be
created or go away for the duration of the split and regardless any
mapping that exists only exists for the pmd and the head page (tail
pages are invisible to rmap until later).

> - (not a concern, but an opportunity) when splitting pages, there are
> two atomic adds to the tail _count field, while we know the initial
> value is 0. Why not just one straight assignment ? Similarly, the
> adjustments to page head count could be added into a local variable
> and the page head count could be updated once after all tail pages
> have been split off.

That's an optimization I can look into agreed. I guess I just added
one line and not even think too much at optimizing this,
split_huge_page isn't in a fast path.

> - Not sure if we could/should add assertions to make sure people call
> the right get_page variant.

Not right now or it'd flood when anybody uses O_DIRECT. If O_DIRECT
gets fixes to stop doing this, it sounds definitely good idea.

I already tried adding a printk to the got=1 path and it floods with a
128M/sec dd bs=10M iflag=direct transfer.

> The other question I have is about the use of the pagemap.h RCU
> protocol for eventual page count stability. With your proposal, this
> would now affect only head pages, so THP splitting is fine :) . I'm
> not sure who else might use that protocol, but it looks like we should
> either make all get_pages_unless_zero call sites follow it (if the
> protocol matters to someone) or none (if the protocol turns out to be
> obsolete).

I don't see who is using synchronize_rcu to stabilize the page count
so at first sight it seems superfluous there too. Maybe it was a "if
anybody will ever need to stabilize the page count this can be
used". The only calls of synchronize_rcu in mm/* are in memcg and in
mmu notifier which is not meant to synchronize the page count but just
to walk the mmu notifier registration list lockless from the mm
struct.

I guess we need to ask who wrote that function for clarifications on
the page count stabilization. And if one needs really to stabilize the
page count he will also need Paul's rcu_sequence_t feature to do it
really efficiently (which is now on hold, so if that synchronize_rcu
caller really exists that would likely also mean we need
rcu_sequence_t to optimize it properly). My current feeling is if one
needs that feature he's doing something's wrong that could be achieved
somewhere else but I may be biased by the fact this one worked out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
  2011-08-24  0:09       ` Andrea Arcangeli
@ 2011-08-24  0:27         ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-24  0:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Wed, Aug 24, 2011 at 02:09:14AM +0200, Andrea Arcangeli wrote:
> That's an optimization I can look into agreed. I guess I just added
> one line and not even think too much at optimizing this,
> split_huge_page isn't in a fast path.

So this would more or less be the optimization (untested):

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1169,8 +1169,8 @@ static void __split_huge_page_refcount(s
 		atomic_sub(page_mapcount(page_tail), &page->_count);
 		BUG_ON(atomic_read(&page->_count) <= 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		atomic_add(page_mapcount(page_tail), &page_tail->_count);
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();

This might also be possible but I'm scared by it because the value
would be set by the C language without locked op, and I wonder then
what happens with get_page_unless_zero runs. Now we relay on atomic
(without lock prefix) writes from C for all pagetable updates
already. So I guess this might actually work ok too in
practice. get_page_unless_zero not incrementing anything sounds
unlikely, and it's hard to see how it could increment zero or a random
value if the "long" write is done in a single asm insn (like we relay
in other places). But still the above is obviously safe, the below far
less obvious and generally one always is forced to use locked ops on
any region of memory concurrently modified to have a deterministic
result. And there's nothing anywhere doing atomic_set on page->_count
except at boot where there are no races before the pages are visible
to the buddy allocator. So for now I'll stick to the above version
unless somebody can guarantee the safety of the below (which I can't).
Comments welcome..

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1169,8 +1169,8 @@ static void __split_huge_page_refcount(s
 		atomic_sub(page_mapcount(page_tail), &page->_count);
 		BUG_ON(atomic_read(&page->_count) <= 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		atomic_add(page_mapcount(page_tail), &page_tail->_count);
+		atomic_set(&page_tail->_count,
+			   page_mapcount(page) + page_mapcount(page_tail) + 1);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix
@ 2011-08-24  0:27         ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-24  0:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Wed, Aug 24, 2011 at 02:09:14AM +0200, Andrea Arcangeli wrote:
> That's an optimization I can look into agreed. I guess I just added
> one line and not even think too much at optimizing this,
> split_huge_page isn't in a fast path.

So this would more or less be the optimization (untested):

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1169,8 +1169,8 @@ static void __split_huge_page_refcount(s
 		atomic_sub(page_mapcount(page_tail), &page->_count);
 		BUG_ON(atomic_read(&page->_count) <= 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		atomic_add(page_mapcount(page_tail), &page_tail->_count);
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();

This might also be possible but I'm scared by it because the value
would be set by the C language without locked op, and I wonder then
what happens with get_page_unless_zero runs. Now we relay on atomic
(without lock prefix) writes from C for all pagetable updates
already. So I guess this might actually work ok too in
practice. get_page_unless_zero not incrementing anything sounds
unlikely, and it's hard to see how it could increment zero or a random
value if the "long" write is done in a single asm insn (like we relay
in other places). But still the above is obviously safe, the below far
less obvious and generally one always is forced to use locked ops on
any region of memory concurrently modified to have a deterministic
result. And there's nothing anywhere doing atomic_set on page->_count
except at boot where there are no races before the pages are visible
to the buddy allocator. So for now I'll stick to the above version
unless somebody can guarantee the safety of the below (which I can't).
Comments welcome..

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1169,8 +1169,8 @@ static void __split_huge_page_refcount(s
 		atomic_sub(page_mapcount(page_tail), &page->_count);
 		BUG_ON(atomic_read(&page->_count) <= 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		atomic_add(page_mapcount(page_tail), &page_tail->_count);
+		atomic_set(&page_tail->_count,
+			   page_mapcount(page) + page_mapcount(page_tail) + 1);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #2
  2011-08-24  0:27         ` Andrea Arcangeli
@ 2011-08-24 13:34           ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-24 13:34 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Wed, Aug 24, 2011 at 02:27:17AM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 24, 2011 at 02:09:14AM +0200, Andrea Arcangeli wrote:
> > That's an optimization I can look into agreed. I guess I just added
> > one line and not even think too much at optimizing this,
> > split_huge_page isn't in a fast path.
> 
> So this would more or less be the optimization (untested):
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1169,8 +1169,8 @@ static void __split_huge_page_refcount(s
>  		atomic_sub(page_mapcount(page_tail), &page->_count);
>  		BUG_ON(atomic_read(&page->_count) <= 0);
>  		BUG_ON(atomic_read(&page_tail->_count) != 0);
> -		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> -		atomic_add(page_mapcount(page_tail), &page_tail->_count);
> +		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> +			   &page_tail->_count);
>  
>  		/* after clearing PageTail the gup refcount can be released */
>  		smp_mb();

So this is a new version incorporating only the above
microoptimization. Unless somebody can guarantee me the atomic_set is
safe in all archs (which requires get_page_unless_zero() running vs C
language page_tail->_count = 1 to provide a deterministic result) I'd
stick with the atomic_add above to be sure.

I think on even on x86 32bit it wouldn't be safe on PPro with OOSTORE
(PPro errata 66, 92) which should also have PSE.

===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -355,38 +355,80 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, like _count, starts from -1:
+ * so that transitions both from it and to it can be tracked,
+ * using atomic_inc_and_test and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline void get_page(struct page *page)
+static inline void __get_page_tail_foll(struct page *page)
 {
 	/*
-	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
-	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+extern int __get_page_tail(struct page *page);
+
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
 		/*
 		 * This is safe only because
 		 * __split_huge_page_refcount can't run under
-		 * get_page().
+		 * get_page_foll() because we hold the proper PT lock.
 		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
+		__get_page_tail_foll(page);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
 	}
 }
 
+static inline void get_page(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		if (__get_page_tail(page))
+			return;
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
+	atomic_inc(&page->_count);
+}
+
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
@@ -803,21 +845,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1164,11 +1164,13 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		atomic_sub(page_mapcount(page_tail), &page->_count);
+		BUG_ON(atomic_read(&page->_count) <= 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1206,7 +1208,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1514,7 +1514,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -128,9 +128,10 @@ static void put_compound_page(struct pag
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +161,32 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *head_page = compound_trans_head(page);
+	if (likely(page != head_page)) {
+		flags = compound_lock_irqsave(head_page);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page);
+			got = 1;
+		}
+		compound_unlock_irqrestore(head_page, flags);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #2
@ 2011-08-24 13:34           ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-24 13:34 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Wed, Aug 24, 2011 at 02:27:17AM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 24, 2011 at 02:09:14AM +0200, Andrea Arcangeli wrote:
> > That's an optimization I can look into agreed. I guess I just added
> > one line and not even think too much at optimizing this,
> > split_huge_page isn't in a fast path.
> 
> So this would more or less be the optimization (untested):
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1169,8 +1169,8 @@ static void __split_huge_page_refcount(s
>  		atomic_sub(page_mapcount(page_tail), &page->_count);
>  		BUG_ON(atomic_read(&page->_count) <= 0);
>  		BUG_ON(atomic_read(&page_tail->_count) != 0);
> -		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> -		atomic_add(page_mapcount(page_tail), &page_tail->_count);
> +		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> +			   &page_tail->_count);
>  
>  		/* after clearing PageTail the gup refcount can be released */
>  		smp_mb();

So this is a new version incorporating only the above
microoptimization. Unless somebody can guarantee me the atomic_set is
safe in all archs (which requires get_page_unless_zero() running vs C
language page_tail->_count = 1 to provide a deterministic result) I'd
stick with the atomic_add above to be sure.

I think on even on x86 32bit it wouldn't be safe on PPro with OOSTORE
(PPro errata 66, 92) which should also have PSE.

===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -355,38 +355,80 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, like _count, starts from -1:
+ * so that transitions both from it and to it can be tracked,
+ * using atomic_inc_and_test and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline void get_page(struct page *page)
+static inline void __get_page_tail_foll(struct page *page)
 {
 	/*
-	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
-	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+extern int __get_page_tail(struct page *page);
+
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
 		/*
 		 * This is safe only because
 		 * __split_huge_page_refcount can't run under
-		 * get_page().
+		 * get_page_foll() because we hold the proper PT lock.
 		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
+		__get_page_tail_foll(page);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
 	}
 }
 
+static inline void get_page(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		if (__get_page_tail(page))
+			return;
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
+	atomic_inc(&page->_count);
+}
+
 static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
@@ -803,21 +845,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1164,11 +1164,13 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		atomic_sub(page_mapcount(page_tail), &page->_count);
+		BUG_ON(atomic_read(&page->_count) <= 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1206,7 +1208,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1514,7 +1514,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -128,9 +128,10 @@ static void put_compound_page(struct pag
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +161,32 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *head_page = compound_trans_head(page);
+	if (likely(page != head_page)) {
+		flags = compound_lock_irqsave(head_page);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page);
+			got = 1;
+		}
+		compound_unlock_irqrestore(head_page, flags);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #2
  2011-08-24 13:34           ` Andrea Arcangeli
@ 2011-08-26  6:24             ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-26  6:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Wed, Aug 24, 2011 at 03:34:59PM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 24, 2011 at 02:27:17AM +0200, Andrea Arcangeli wrote:
> > On Wed, Aug 24, 2011 at 02:09:14AM +0200, Andrea Arcangeli wrote:
> > > That's an optimization I can look into agreed. I guess I just added
> > > one line and not even think too much at optimizing this,
> > > split_huge_page isn't in a fast path.
> > 
> > So this would more or less be the optimization (untested):
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1169,8 +1169,8 @@ static void __split_huge_page_refcount(s
> >  		atomic_sub(page_mapcount(page_tail), &page->_count);
> >  		BUG_ON(atomic_read(&page->_count) <= 0);
> >  		BUG_ON(atomic_read(&page_tail->_count) != 0);
> > -		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> > -		atomic_add(page_mapcount(page_tail), &page_tail->_count);
> > +		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> > +			   &page_tail->_count);
> >  
> >  		/* after clearing PageTail the gup refcount can be released */
> >  		smp_mb();
> 
> So this is a new version incorporating only the above
> microoptimization. Unless somebody can guarantee me the atomic_set is
> safe in all archs (which requires get_page_unless_zero() running vs C
> language page_tail->_count = 1 to provide a deterministic result) I'd
> stick with the atomic_add above to be sure.
> 
> I think on even on x86 32bit it wouldn't be safe on PPro with OOSTORE
> (PPro errata 66, 92) which should also have PSE.

I had never heard before of locked instructions being necessary when a
straight assignment would do what we want, but after reading the erratas
you listed, I'm not so sure anymore. Given that, I think the version with
just one single atomic add is good enough.

(there are also 511 consecutive atomic_sub calls on the head page _count,
which could just as well be coalesced into a signle one at the end of the
tail page loop).


But enough about the atomics - there are other points I want to feedback on.


I think your current __get_page_tail() is unsafe when it takes the
compound lock on the head page, because there is no refcount held on it.
If the THP page gets broken up before we get the compound lock, the head
page could get freed. But it looks like you could fix that by doing
get_page_unless_zero on the head, and you should end up with something
very much like the put_page() function, which I find incredibly tricky
but seems to be safe.


I would suggest moving get_page_foll() and __get_page_tail_foll() to
mm/internal.h so that people writing code outside of mm/ don't get confused
about which get_page() version they must call.


In __get_page_tail(), you could add a VM_BUG_ON(page_mapcount(page) <= 0)
to reflect the fact that get_page() callers are expected to have already
gotten a reference on the page through a gup call.


(not your fault, you just moved that code) The comment above
reset_page_mapcount() and page_mapcount() mentions that _count starts from -1.
This does not seem to be accurate anymore - as you see page_count() just
returns the _count value without adding 1. I guess you could just remove
', like _count,' from the comment and that'd make it accurate :)


The use of _mapcount to store tail page counts should probably be
documented somewhere - probably in mm_types.h where _mapcount is
defined, and/or before the page_mapcount accessor function. Or, there
could be a tail_page_count() accessor function for that so that it's
evident in all call sites that we're accessing a refcount and not a mapcount:

static inline int tail_page_count(struct page *page)
{
	VM_BUG_ON(!PageTail(page));
	return page_mapcount(page);
}


(probably for another commit) I'm not too comfortable with having several
arch-specific fast gup functions knowning details about how page counts
are implemented. Linus's tree also adds such support in sparc arch
(and it doesn't even seem to be correct as it increments the head count
but not the tail count). This should probably be cleaned up sometime by
moving such details into generic inline helper functions.


Besides these comments, overall I like the change a lot & I'm especially
happy to see get_page() work in all cases again :)

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #2
@ 2011-08-26  6:24             ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-26  6:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Wed, Aug 24, 2011 at 03:34:59PM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 24, 2011 at 02:27:17AM +0200, Andrea Arcangeli wrote:
> > On Wed, Aug 24, 2011 at 02:09:14AM +0200, Andrea Arcangeli wrote:
> > > That's an optimization I can look into agreed. I guess I just added
> > > one line and not even think too much at optimizing this,
> > > split_huge_page isn't in a fast path.
> > 
> > So this would more or less be the optimization (untested):
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1169,8 +1169,8 @@ static void __split_huge_page_refcount(s
> >  		atomic_sub(page_mapcount(page_tail), &page->_count);
> >  		BUG_ON(atomic_read(&page->_count) <= 0);
> >  		BUG_ON(atomic_read(&page_tail->_count) != 0);
> > -		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> > -		atomic_add(page_mapcount(page_tail), &page_tail->_count);
> > +		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> > +			   &page_tail->_count);
> >  
> >  		/* after clearing PageTail the gup refcount can be released */
> >  		smp_mb();
> 
> So this is a new version incorporating only the above
> microoptimization. Unless somebody can guarantee me the atomic_set is
> safe in all archs (which requires get_page_unless_zero() running vs C
> language page_tail->_count = 1 to provide a deterministic result) I'd
> stick with the atomic_add above to be sure.
> 
> I think on even on x86 32bit it wouldn't be safe on PPro with OOSTORE
> (PPro errata 66, 92) which should also have PSE.

I had never heard before of locked instructions being necessary when a
straight assignment would do what we want, but after reading the erratas
you listed, I'm not so sure anymore. Given that, I think the version with
just one single atomic add is good enough.

(there are also 511 consecutive atomic_sub calls on the head page _count,
which could just as well be coalesced into a signle one at the end of the
tail page loop).


But enough about the atomics - there are other points I want to feedback on.


I think your current __get_page_tail() is unsafe when it takes the
compound lock on the head page, because there is no refcount held on it.
If the THP page gets broken up before we get the compound lock, the head
page could get freed. But it looks like you could fix that by doing
get_page_unless_zero on the head, and you should end up with something
very much like the put_page() function, which I find incredibly tricky
but seems to be safe.


I would suggest moving get_page_foll() and __get_page_tail_foll() to
mm/internal.h so that people writing code outside of mm/ don't get confused
about which get_page() version they must call.


In __get_page_tail(), you could add a VM_BUG_ON(page_mapcount(page) <= 0)
to reflect the fact that get_page() callers are expected to have already
gotten a reference on the page through a gup call.


(not your fault, you just moved that code) The comment above
reset_page_mapcount() and page_mapcount() mentions that _count starts from -1.
This does not seem to be accurate anymore - as you see page_count() just
returns the _count value without adding 1. I guess you could just remove
', like _count,' from the comment and that'd make it accurate :)


The use of _mapcount to store tail page counts should probably be
documented somewhere - probably in mm_types.h where _mapcount is
defined, and/or before the page_mapcount accessor function. Or, there
could be a tail_page_count() accessor function for that so that it's
evident in all call sites that we're accessing a refcount and not a mapcount:

static inline int tail_page_count(struct page *page)
{
	VM_BUG_ON(!PageTail(page));
	return page_mapcount(page);
}


(probably for another commit) I'm not too comfortable with having several
arch-specific fast gup functions knowning details about how page counts
are implemented. Linus's tree also adds such support in sparc arch
(and it doesn't even seem to be correct as it increments the head count
but not the tail count). This should probably be cleaned up sometime by
moving such details into generic inline helper functions.


Besides these comments, overall I like the change a lot & I'm especially
happy to see get_page() work in all cases again :)

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #2
  2011-08-26  6:24             ` Michel Lespinasse
@ 2011-08-26 16:10               ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-26 16:10 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Thu, Aug 25, 2011 at 11:24:36PM -0700, Michel Lespinasse wrote:
> I had never heard before of locked instructions being necessary when a
> straight assignment would do what we want, but after reading the erratas
> you listed, I'm not so sure anymore. Given that, I think the version with
> just one single atomic add is good enough.

spin_unlock sometime is adding the lock prefix too for that reason. So
I feel safer that way.

> (there are also 511 consecutive atomic_sub calls on the head page _count,
> which could just as well be coalesced into a signle one at the end of the
> tail page loop).

That should be safe. It's not like I'm a mood to microoptimize
__split_huge_page_refcount after you found I forgot the
get_page_unless_zero needed to keep the page->flags stable (they're
overwritten by the time the head page is freed, that is why we need it).

> I think your current __get_page_tail() is unsafe when it takes the
> compound lock on the head page, because there is no refcount held on it.
> If the THP page gets broken up before we get the compound lock, the head
> page could get freed. But it looks like you could fix that by doing
> get_page_unless_zero on the head, and you should end up with something
> very much like the put_page() function, which I find incredibly tricky
> but seems to be safe.

Correct, it's enough and we need it for the same reason it is in
put_page. Nothing new or no new fundamental problem with this
approach, just an implementation mistake. At least it could introduced
no regression compared to the previous code.

> I would suggest moving get_page_foll() and __get_page_tail_foll() to
> mm/internal.h so that people writing code outside of mm/ don't get confused
> about which get_page() version they must call.

Good idea. That is for MM internal usage only, only follow_page is
allowed to call it.

> In __get_page_tail(), you could add a VM_BUG_ON(page_mapcount(page) <= 0)
> to reflect the fact that get_page() callers are expected to have already
> gotten a reference on the page through a gup call.

So I could put it just before calling __get_page_tail_foll().

I don't see a way anybody could call get_page on a tail page without
having called gup on it first. So I think it's correct. Any
pfn-scanning code like your working set estimation code has to use
get_page_unless_zero and that will never succeed anymore for tail
pages.

> (not your fault, you just moved that code) The comment above
> reset_page_mapcount() and page_mapcount() mentions that _count starts from -1.
> This does not seem to be accurate anymore - as you see page_count() just
> returns the _count value without adding 1. I guess you could just remove
> ', like _count,' from the comment and that'd make it accurate :)

The comment talks about _mapcount not _count. page_mapcount still adds
1 to _mapcount and _mapcount really still starts from -1.

> The use of _mapcount to store tail page counts should probably be
> documented somewhere - probably in mm_types.h where _mapcount is
> defined, and/or before the page_mapcount accessor function. Or, there
> could be a tail_page_count() accessor function for that so that it's
> evident in all call sites that we're accessing a refcount and not a mapcount:
> 
> static inline int tail_page_count(struct page *page)
> {
> 	VM_BUG_ON(!PageTail(page));
> 	return page_mapcount(page);
> }
> 
> 
> (probably for another commit) I'm not too comfortable with having several
> arch-specific fast gup functions knowning details about how page counts
> are implemented. Linus's tree also adds such support in sparc arch
> (and it doesn't even seem to be correct as it increments the head count
> but not the tail count). This should probably be cleaned up sometime by
> moving such details into generic inline helper functions.
> 
> 
> Besides these comments, overall I like the change a lot & I'm especially
> happy to see get_page() work in all cases again :)

Glad to hear :).

Thanks a lot for pointing out the missing get_page_unless_zero(). I'll
post a #3 version soon with that bit fixed.

I'm undecided of tail_page_count is needed. The only benefit would be
to be able to grep for tail_page_count and see the few call sites, maybe
that makes it worth it. The VM_BUG_ON I doubt is necessary there
considering it's easy to review the callsites and they're so few. It'd
also need to go into internal.h I guess.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #2
@ 2011-08-26 16:10               ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-26 16:10 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Thu, Aug 25, 2011 at 11:24:36PM -0700, Michel Lespinasse wrote:
> I had never heard before of locked instructions being necessary when a
> straight assignment would do what we want, but after reading the erratas
> you listed, I'm not so sure anymore. Given that, I think the version with
> just one single atomic add is good enough.

spin_unlock sometime is adding the lock prefix too for that reason. So
I feel safer that way.

> (there are also 511 consecutive atomic_sub calls on the head page _count,
> which could just as well be coalesced into a signle one at the end of the
> tail page loop).

That should be safe. It's not like I'm a mood to microoptimize
__split_huge_page_refcount after you found I forgot the
get_page_unless_zero needed to keep the page->flags stable (they're
overwritten by the time the head page is freed, that is why we need it).

> I think your current __get_page_tail() is unsafe when it takes the
> compound lock on the head page, because there is no refcount held on it.
> If the THP page gets broken up before we get the compound lock, the head
> page could get freed. But it looks like you could fix that by doing
> get_page_unless_zero on the head, and you should end up with something
> very much like the put_page() function, which I find incredibly tricky
> but seems to be safe.

Correct, it's enough and we need it for the same reason it is in
put_page. Nothing new or no new fundamental problem with this
approach, just an implementation mistake. At least it could introduced
no regression compared to the previous code.

> I would suggest moving get_page_foll() and __get_page_tail_foll() to
> mm/internal.h so that people writing code outside of mm/ don't get confused
> about which get_page() version they must call.

Good idea. That is for MM internal usage only, only follow_page is
allowed to call it.

> In __get_page_tail(), you could add a VM_BUG_ON(page_mapcount(page) <= 0)
> to reflect the fact that get_page() callers are expected to have already
> gotten a reference on the page through a gup call.

So I could put it just before calling __get_page_tail_foll().

I don't see a way anybody could call get_page on a tail page without
having called gup on it first. So I think it's correct. Any
pfn-scanning code like your working set estimation code has to use
get_page_unless_zero and that will never succeed anymore for tail
pages.

> (not your fault, you just moved that code) The comment above
> reset_page_mapcount() and page_mapcount() mentions that _count starts from -1.
> This does not seem to be accurate anymore - as you see page_count() just
> returns the _count value without adding 1. I guess you could just remove
> ', like _count,' from the comment and that'd make it accurate :)

The comment talks about _mapcount not _count. page_mapcount still adds
1 to _mapcount and _mapcount really still starts from -1.

> The use of _mapcount to store tail page counts should probably be
> documented somewhere - probably in mm_types.h where _mapcount is
> defined, and/or before the page_mapcount accessor function. Or, there
> could be a tail_page_count() accessor function for that so that it's
> evident in all call sites that we're accessing a refcount and not a mapcount:
> 
> static inline int tail_page_count(struct page *page)
> {
> 	VM_BUG_ON(!PageTail(page));
> 	return page_mapcount(page);
> }
> 
> 
> (probably for another commit) I'm not too comfortable with having several
> arch-specific fast gup functions knowning details about how page counts
> are implemented. Linus's tree also adds such support in sparc arch
> (and it doesn't even seem to be correct as it increments the head count
> but not the tail count). This should probably be cleaned up sometime by
> moving such details into generic inline helper functions.
> 
> 
> Besides these comments, overall I like the change a lot & I'm especially
> happy to see get_page() work in all cases again :)

Glad to hear :).

Thanks a lot for pointing out the missing get_page_unless_zero(). I'll
post a #3 version soon with that bit fixed.

I'm undecided of tail_page_count is needed. The only benefit would be
to be able to grep for tail_page_count and see the few call sites, maybe
that makes it worth it. The VM_BUG_ON I doubt is necessary there
considering it's easy to review the callsites and they're so few. It'd
also need to go into internal.h I guess.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #3
  2011-08-26 16:10               ` Andrea Arcangeli
@ 2011-08-26 18:54                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-26 18:54 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Fri, Aug 26, 2011 at 06:10:48PM +0200, Andrea Arcangeli wrote:
> Thanks a lot for pointing out the missing get_page_unless_zero(). I'll
> post a #3 version soon with that bit fixed.

So here an incremental change to review, it survived the initial
O_DIRECT-on-thp/thp-swapping simultaneous stress testing so far, but
this is inconclusive because those races are theoretical and can't be
reproduced anyway, so it'll require more review. The cleanup of
page_tail_count couldn't be done in internal.h, it requires a larger
cleanup I prefer to do separately if needed, as it'd move code around
making it harder to review changes.

I also took opportunity to remove the PageHead check that was only for
debug to implement the VM_BUG_ON as documented by
split_huge_page_refcount too (the compound_lock always could have been
run after the page_head wasn't an head page anymore, and that's ok as
long as we've the refcount).

This makes put_compound_page more similar to __get_page_tail.

The put_page in __get_page_tail could be done unconditionally instead
of doing put_page_testzero(page_head) inside the critical section of
__get_page_tail, but this is done there so we can VM_BUG_ON if the
refcount reaches zero, because it must not if the page is a tail
page and we hold the lock.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,9 +356,9 @@ static inline struct page *compound_head
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
  */
 static inline void reset_page_mapcount(struct page *page)
 {
@@ -397,29 +397,10 @@ static inline void __get_page_tail_foll(
 
 extern int __get_page_tail(struct page *page);
 
-static inline void get_page_foll(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page_foll() because we hold the proper PT lock.
-		 */
-		__get_page_tail_foll(page);
-	else {
-		/*
-		 * Getting a normal page or the head of a compound page
-		 * requires to already have an elevated page->_count.
-		 */
-		VM_BUG_ON(atomic_read(&page->_count) <= 0);
-		atomic_inc(&page->_count);
-	}
-}
-
 static inline void get_page(struct page *page)
 {
 	if (unlikely(PageTail(page)))
-		if (__get_page_tail(page))
+		if (likely(__get_page_tail(page)))
 			return;
 	/*
 	 * Getting a normal page or the head of a compound page
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -39,6 +39,16 @@ struct page {
 		atomic_t _mapcount;	/* Count of ptes mapped in mms,
 					 * to show when page is mapped
 					 * & limit reverse map searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
 					 */
 		struct {		/* SLUB */
 			u16 inuse;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1166,8 +1167,9 @@ static void __split_huge_page_refcount(s
 
 		/* tail_page->_mapcount cannot change */
 		BUG_ON(page_mapcount(page_tail) < 0);
-		atomic_sub(page_mapcount(page_tail), &page->_count);
-		BUG_ON(atomic_read(&page->_count) <= 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
 		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
 			   &page_tail->_count);
@@ -1188,10 +1190,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1224,6 +1223,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,25 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,21 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,9 +103,9 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
@@ -173,15 +155,37 @@ int __get_page_tail(struct page *page)
 	 */
 	unsigned long flags;
 	int got = 0;
-	struct page *head_page = compound_trans_head(page);
-	if (likely(page != head_page)) {
-		flags = compound_lock_irqsave(head_page);
+	struct page *page_head = compound_trans_head(page);
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
+			/*
+			 * get_page() can only be called on tail pages
+			 * after get_page_foll() taken a tail page
+			 * refcount.
+			 */
+			VM_BUG_ON(page_mapcount(page) <= 0);
 			__get_page_tail_foll(page);
 			got = 1;
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
 		}
-		compound_unlock_irqrestore(head_page, flags);
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
 	}
 	return got;
 }


Full patch:

===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -355,36 +355,59 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+extern int __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -803,21 +826,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -39,6 +39,16 @@ struct page {
 		atomic_t _mapcount;	/* Count of ptes mapped in mms,
 					 * to show when page is mapped
 					 * & limit reverse map searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
 					 */
 		struct {		/* SLUB */
 			u16 inuse;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1164,11 +1165,14 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1186,10 +1190,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1206,7 +1207,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
@@ -1223,6 +1223,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,25 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1514,7 +1514,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,21 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,16 +103,17 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +143,54 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *page_head = compound_trans_head(page);
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			/*
+			 * get_page() can only be called on tail pages
+			 * after get_page_foll() taken a tail page
+			 * refcount.
+			 */
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			__get_page_tail_foll(page);
+			got = 1;
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #3
@ 2011-08-26 18:54                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-26 18:54 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Fri, Aug 26, 2011 at 06:10:48PM +0200, Andrea Arcangeli wrote:
> Thanks a lot for pointing out the missing get_page_unless_zero(). I'll
> post a #3 version soon with that bit fixed.

So here an incremental change to review, it survived the initial
O_DIRECT-on-thp/thp-swapping simultaneous stress testing so far, but
this is inconclusive because those races are theoretical and can't be
reproduced anyway, so it'll require more review. The cleanup of
page_tail_count couldn't be done in internal.h, it requires a larger
cleanup I prefer to do separately if needed, as it'd move code around
making it harder to review changes.

I also took opportunity to remove the PageHead check that was only for
debug to implement the VM_BUG_ON as documented by
split_huge_page_refcount too (the compound_lock always could have been
run after the page_head wasn't an head page anymore, and that's ok as
long as we've the refcount).

This makes put_compound_page more similar to __get_page_tail.

The put_page in __get_page_tail could be done unconditionally instead
of doing put_page_testzero(page_head) inside the critical section of
__get_page_tail, but this is done there so we can VM_BUG_ON if the
refcount reaches zero, because it must not if the page is a tail
page and we hold the lock.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,9 +356,9 @@ static inline struct page *compound_head
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
  */
 static inline void reset_page_mapcount(struct page *page)
 {
@@ -397,29 +397,10 @@ static inline void __get_page_tail_foll(
 
 extern int __get_page_tail(struct page *page);
 
-static inline void get_page_foll(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page_foll() because we hold the proper PT lock.
-		 */
-		__get_page_tail_foll(page);
-	else {
-		/*
-		 * Getting a normal page or the head of a compound page
-		 * requires to already have an elevated page->_count.
-		 */
-		VM_BUG_ON(atomic_read(&page->_count) <= 0);
-		atomic_inc(&page->_count);
-	}
-}
-
 static inline void get_page(struct page *page)
 {
 	if (unlikely(PageTail(page)))
-		if (__get_page_tail(page))
+		if (likely(__get_page_tail(page)))
 			return;
 	/*
 	 * Getting a normal page or the head of a compound page
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -39,6 +39,16 @@ struct page {
 		atomic_t _mapcount;	/* Count of ptes mapped in mms,
 					 * to show when page is mapped
 					 * & limit reverse map searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
 					 */
 		struct {		/* SLUB */
 			u16 inuse;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1166,8 +1167,9 @@ static void __split_huge_page_refcount(s
 
 		/* tail_page->_mapcount cannot change */
 		BUG_ON(page_mapcount(page_tail) < 0);
-		atomic_sub(page_mapcount(page_tail), &page->_count);
-		BUG_ON(atomic_read(&page->_count) <= 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
 		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
 			   &page_tail->_count);
@@ -1188,10 +1190,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1224,6 +1223,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,25 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,21 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,9 +103,9 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
@@ -173,15 +155,37 @@ int __get_page_tail(struct page *page)
 	 */
 	unsigned long flags;
 	int got = 0;
-	struct page *head_page = compound_trans_head(page);
-	if (likely(page != head_page)) {
-		flags = compound_lock_irqsave(head_page);
+	struct page *page_head = compound_trans_head(page);
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
+			/*
+			 * get_page() can only be called on tail pages
+			 * after get_page_foll() taken a tail page
+			 * refcount.
+			 */
+			VM_BUG_ON(page_mapcount(page) <= 0);
 			__get_page_tail_foll(page);
 			got = 1;
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
 		}
-		compound_unlock_irqrestore(head_page, flags);
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
 	}
 	return got;
 }


Full patch:

===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -355,36 +355,59 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+extern int __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -803,21 +826,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -39,6 +39,16 @@ struct page {
 		atomic_t _mapcount;	/* Count of ptes mapped in mms,
 					 * to show when page is mapped
 					 * & limit reverse map searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
 					 */
 		struct {		/* SLUB */
 			u16 inuse;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1164,11 +1165,14 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1186,10 +1190,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1206,7 +1207,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
@@ -1223,6 +1223,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,25 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1514,7 +1514,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,21 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,16 +103,17 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +143,54 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *page_head = compound_trans_head(page);
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			/*
+			 * get_page() can only be called on tail pages
+			 * after get_page_foll() taken a tail page
+			 * refcount.
+			 */
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			__get_page_tail_foll(page);
+			got = 1;
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #2
  2011-08-26  6:24             ` Michel Lespinasse
@ 2011-08-26 19:28               ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-26 19:28 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Thu, Aug 25, 2011 at 11:24:36PM -0700, Michel Lespinasse wrote:
> In __get_page_tail(), you could add a VM_BUG_ON(page_mapcount(page) <= 0)
> to reflect the fact that get_page() callers are expected to have already
> gotten a reference on the page through a gup call.

Turns out this is going to generate false positives. For THP it should
have been always ok, but if you allocate a compound page (that can't
be splitted) and then map it on 4k pagetables and doing
get_page/put_page in the map/unmap of the pte, it'll fail because the
page fault will be the first occurrence where the tail page refcount
is elevated. I'll check it in more detail tomorrow... So you may want
to delete the bugcheck above before testing #3.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #2
@ 2011-08-26 19:28               ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-26 19:28 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Thu, Aug 25, 2011 at 11:24:36PM -0700, Michel Lespinasse wrote:
> In __get_page_tail(), you could add a VM_BUG_ON(page_mapcount(page) <= 0)
> to reflect the fact that get_page() callers are expected to have already
> gotten a reference on the page through a gup call.

Turns out this is going to generate false positives. For THP it should
have been always ok, but if you allocate a compound page (that can't
be splitted) and then map it on 4k pagetables and doing
get_page/put_page in the map/unmap of the pte, it'll fail because the
page fault will be the first occurrence where the tail page refcount
is elevated. I'll check it in more detail tomorrow... So you may want
to delete the bugcheck above before testing #3.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #3
  2011-08-26 18:54                 ` Andrea Arcangeli
@ 2011-08-27  9:41                   ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-27  9:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Fri, Aug 26, 2011 at 08:54:36PM +0200, Andrea Arcangeli wrote:
> Subject: thp: tail page refcounting fix
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
> 
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
> 
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
> 
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>

Looks great !

I understand you may have to remove the VM_BUG_ON(page_mapcount(page) <= 0)
that I had suggested in __get_page_tail() (sorry about that).

My only additional suggestion is about the put_page_testzero in
__get_page_tail(), maybe if you could just increment the tail page count
instead of calling __get_page_tail_foll(), then you wouldn't have to
release the extra head page count there. And it would even look kinda
natural, head page count gets acquired before compound_lock_irqsave(),
so we only have to acquire an extra tail page count after confirming
this is still a tail page.

Either way, the code looks OK by now.

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #3
@ 2011-08-27  9:41                   ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-27  9:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Fri, Aug 26, 2011 at 08:54:36PM +0200, Andrea Arcangeli wrote:
> Subject: thp: tail page refcounting fix
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
> 
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
> 
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
> 
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>

Looks great !

I understand you may have to remove the VM_BUG_ON(page_mapcount(page) <= 0)
that I had suggested in __get_page_tail() (sorry about that).

My only additional suggestion is about the put_page_testzero in
__get_page_tail(), maybe if you could just increment the tail page count
instead of calling __get_page_tail_foll(), then you wouldn't have to
release the extra head page count there. And it would even look kinda
natural, head page count gets acquired before compound_lock_irqsave(),
so we only have to acquire an extra tail page count after confirming
this is still a tail page.

Either way, the code looks OK by now.

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #4
  2011-08-27  9:41                   ` Michel Lespinasse
@ 2011-08-27 17:34                     ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-27 17:34 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Sat, Aug 27, 2011 at 02:41:52AM -0700, Michel Lespinasse wrote:
> I understand you may have to remove the VM_BUG_ON(page_mapcount(page) <= 0)
> that I had suggested in __get_page_tail() (sorry about that).

The function doing that is snd_pcm_mmap_data_fault and it's doing what
I described in prev email.

> My only additional suggestion is about the put_page_testzero in
> __get_page_tail(), maybe if you could just increment the tail page count
> instead of calling __get_page_tail_foll(), then you wouldn't have to
> release the extra head page count there. And it would even look kinda
> natural, head page count gets acquired before compound_lock_irqsave(),
> so we only have to acquire an extra tail page count after confirming
> this is still a tail page.

Ok, I added a param to __get_page_tail_foll, it is constant at build
time and because it's inline the branch should be optimized away at
build time without requiring a separate function. The bugchecks are
the same so we can share and just skip the atomic_inc on the
page_head in __get_page_tail_foll. Also it had to be moved into
internal.h as a further cleanup.

> Either way, the code looks OK by now.
> 
> Reviewed-by: Michel Lespinasse <walken@google.com>

Thanks. Incremental diff to correct the false positive bug on for
drivers like alsa allocating __GFP_COMP and mapping subpages with page
faults.

diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -166,12 +166,6 @@ int __get_page_tail(struct page *page)
 		flags = compound_lock_irqsave(page_head);
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
-			/*
-			 * get_page() can only be called on tail pages
-			 * after get_page_foll() taken a tail page
-			 * refcount.
-			 */
-			VM_BUG_ON(page_mapcount(page) <= 0);
 			__get_page_tail_foll(page);
 			got = 1;

 			/*

This is the optimization.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -375,26 +375,6 @@ static inline int page_count(struct page
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline void __get_page_tail_foll(struct page *page)
-{
-	/*
-	 * If we're getting a tail page, the elevated page->_count is
-	 * required only in the head page and we will elevate the head
-	 * page->_count and tail page->_mapcount.
-	 *
-	 * We elevate page_tail->_mapcount for tail pages to force
-	 * page_tail->_count to be zero at all times to avoid getting
-	 * false positives from get_page_unless_zero() with
-	 * speculative page access (like in
-	 * page_cache_get_speculative()) on tail pages.
-	 */
-	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	VM_BUG_ON(page_mapcount(page) < 0);
-	atomic_inc(&page->first_page->_count);
-	atomic_inc(&page->_mapcount);
-}
-
 extern int __get_page_tail(struct page *page);
 
 static inline void get_page(struct page *page)
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,28 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
 static inline void get_page_foll(struct page *page)
 {
 	if (unlikely(PageTail(page)))
@@ -45,7 +67,7 @@ static inline void get_page_foll(struct 
 		 * __split_huge_page_refcount() can't run under
 		 * get_page_foll() because we hold the proper PT lock.
 		 */
-		__get_page_tail_foll(page);
+		__get_page_tail_foll(page, true);
 	else {
 		/*
 		 * Getting a normal page or the head of a compound page
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -166,16 +166,8 @@ int __get_page_tail(struct page *page)
 		flags = compound_lock_irqsave(page_head);
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page);
+			__get_page_tail_foll(page, false);
 			got = 1;
-			/*
-			 * We can release the refcount taken by
-			 * get_page_unless_zero() now that
-			 * __split_huge_page_refcount() is blocked on
-			 * the compound_lock.
-			 */
-			if (put_page_testzero(page_head))
-				VM_BUG_ON(1);
 		}
 		compound_unlock_irqrestore(page_head, flags);
 		if (unlikely(!got))


===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,36 +356,39 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
+extern int __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -804,21 +807,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -62,10 +62,23 @@ struct page {
 			struct {
 
 				union {
-					atomic_t _mapcount;	/* Count of ptes mapped in mms,
-							 * to show when page is mapped
-							 * & limit reverse map searches.
-							 */
+					/*
+					 * Count of ptes mapped in
+					 * mms, to show when page is
+					 * mapped & limit reverse map
+					 * searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
+					 */
+					atomic_t _mapcount;
 
 					struct {
 						unsigned inuse:16;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1164,11 +1165,14 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1186,10 +1190,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1206,7 +1207,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
@@ -1223,6 +1223,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,47 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page, true);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1503,7 +1503,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,21 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,16 +103,17 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +143,40 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *page_head = compound_trans_head(page);
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = 1;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #4
@ 2011-08-27 17:34                     ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-27 17:34 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Sat, Aug 27, 2011 at 02:41:52AM -0700, Michel Lespinasse wrote:
> I understand you may have to remove the VM_BUG_ON(page_mapcount(page) <= 0)
> that I had suggested in __get_page_tail() (sorry about that).

The function doing that is snd_pcm_mmap_data_fault and it's doing what
I described in prev email.

> My only additional suggestion is about the put_page_testzero in
> __get_page_tail(), maybe if you could just increment the tail page count
> instead of calling __get_page_tail_foll(), then you wouldn't have to
> release the extra head page count there. And it would even look kinda
> natural, head page count gets acquired before compound_lock_irqsave(),
> so we only have to acquire an extra tail page count after confirming
> this is still a tail page.

Ok, I added a param to __get_page_tail_foll, it is constant at build
time and because it's inline the branch should be optimized away at
build time without requiring a separate function. The bugchecks are
the same so we can share and just skip the atomic_inc on the
page_head in __get_page_tail_foll. Also it had to be moved into
internal.h as a further cleanup.

> Either way, the code looks OK by now.
> 
> Reviewed-by: Michel Lespinasse <walken@google.com>

Thanks. Incremental diff to correct the false positive bug on for
drivers like alsa allocating __GFP_COMP and mapping subpages with page
faults.

diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -166,12 +166,6 @@ int __get_page_tail(struct page *page)
 		flags = compound_lock_irqsave(page_head);
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
-			/*
-			 * get_page() can only be called on tail pages
-			 * after get_page_foll() taken a tail page
-			 * refcount.
-			 */
-			VM_BUG_ON(page_mapcount(page) <= 0);
 			__get_page_tail_foll(page);
 			got = 1;

 			/*

This is the optimization.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -375,26 +375,6 @@ static inline int page_count(struct page
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline void __get_page_tail_foll(struct page *page)
-{
-	/*
-	 * If we're getting a tail page, the elevated page->_count is
-	 * required only in the head page and we will elevate the head
-	 * page->_count and tail page->_mapcount.
-	 *
-	 * We elevate page_tail->_mapcount for tail pages to force
-	 * page_tail->_count to be zero at all times to avoid getting
-	 * false positives from get_page_unless_zero() with
-	 * speculative page access (like in
-	 * page_cache_get_speculative()) on tail pages.
-	 */
-	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	VM_BUG_ON(page_mapcount(page) < 0);
-	atomic_inc(&page->first_page->_count);
-	atomic_inc(&page->_mapcount);
-}
-
 extern int __get_page_tail(struct page *page);
 
 static inline void get_page(struct page *page)
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,28 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
 static inline void get_page_foll(struct page *page)
 {
 	if (unlikely(PageTail(page)))
@@ -45,7 +67,7 @@ static inline void get_page_foll(struct 
 		 * __split_huge_page_refcount() can't run under
 		 * get_page_foll() because we hold the proper PT lock.
 		 */
-		__get_page_tail_foll(page);
+		__get_page_tail_foll(page, true);
 	else {
 		/*
 		 * Getting a normal page or the head of a compound page
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -166,16 +166,8 @@ int __get_page_tail(struct page *page)
 		flags = compound_lock_irqsave(page_head);
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page);
+			__get_page_tail_foll(page, false);
 			got = 1;
-			/*
-			 * We can release the refcount taken by
-			 * get_page_unless_zero() now that
-			 * __split_huge_page_refcount() is blocked on
-			 * the compound_lock.
-			 */
-			if (put_page_testzero(page_head))
-				VM_BUG_ON(1);
 		}
 		compound_unlock_irqrestore(page_head, flags);
 		if (unlikely(!got))


===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,36 +356,39 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
+extern int __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -804,21 +807,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -62,10 +62,23 @@ struct page {
 			struct {
 
 				union {
-					atomic_t _mapcount;	/* Count of ptes mapped in mms,
-							 * to show when page is mapped
-							 * & limit reverse map searches.
-							 */
+					/*
+					 * Count of ptes mapped in
+					 * mms, to show when page is
+					 * mapped & limit reverse map
+					 * searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
+					 */
+					atomic_t _mapcount;
 
 					struct {
 						unsigned inuse:16;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1164,11 +1165,14 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1186,10 +1190,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1206,7 +1207,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
@@ -1223,6 +1223,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,47 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page, true);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1503,7 +1503,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,21 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,16 +103,17 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +143,40 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *page_head = compound_trans_head(page);
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = 1;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #4
  2011-08-27 17:34                     ` Andrea Arcangeli
@ 2011-08-29  4:20                       ` Minchan Kim
  -1 siblings, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2011-08-29  4:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Sun, Aug 28, 2011 at 2:34 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Sat, Aug 27, 2011 at 02:41:52AM -0700, Michel Lespinasse wrote:
>> I understand you may have to remove the VM_BUG_ON(page_mapcount(page) <= 0)
>> that I had suggested in __get_page_tail() (sorry about that).
>
> The function doing that is snd_pcm_mmap_data_fault and it's doing what
> I described in prev email.
>
>> My only additional suggestion is about the put_page_testzero in
>> __get_page_tail(), maybe if you could just increment the tail page count
>> instead of calling __get_page_tail_foll(), then you wouldn't have to
>> release the extra head page count there. And it would even look kinda
>> natural, head page count gets acquired before compound_lock_irqsave(),
>> so we only have to acquire an extra tail page count after confirming
>> this is still a tail page.
>
> Ok, I added a param to __get_page_tail_foll, it is constant at build
> time and because it's inline the branch should be optimized away at
> build time without requiring a separate function. The bugchecks are
> the same so we can share and just skip the atomic_inc on the
> page_head in __get_page_tail_foll. Also it had to be moved into
> internal.h as a further cleanup.
>
>> Either way, the code looks OK by now.
>>
>> Reviewed-by: Michel Lespinasse <walken@google.com>
>
> Thanks. Incremental diff to correct the false positive bug on for
> drivers like alsa allocating __GFP_COMP and mapping subpages with page
> faults.
>
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -166,12 +166,6 @@ int __get_page_tail(struct page *page)
>                flags = compound_lock_irqsave(page_head);
>                /* here __split_huge_page_refcount won't run anymore */
>                if (likely(PageTail(page))) {
> -                       /*
> -                        * get_page() can only be called on tail pages
> -                        * after get_page_foll() taken a tail page
> -                        * refcount.
> -                        */
> -                       VM_BUG_ON(page_mapcount(page) <= 0);
>                        __get_page_tail_foll(page);
>                        got = 1;
>
>                        /*
>
> This is the optimization.
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -375,26 +375,6 @@ static inline int page_count(struct page
>        return atomic_read(&compound_head(page)->_count);
>  }
>
> -static inline void __get_page_tail_foll(struct page *page)
> -{
> -       /*
> -        * If we're getting a tail page, the elevated page->_count is
> -        * required only in the head page and we will elevate the head
> -        * page->_count and tail page->_mapcount.
> -        *
> -        * We elevate page_tail->_mapcount for tail pages to force
> -        * page_tail->_count to be zero at all times to avoid getting
> -        * false positives from get_page_unless_zero() with
> -        * speculative page access (like in
> -        * page_cache_get_speculative()) on tail pages.
> -        */
> -       VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> -       VM_BUG_ON(atomic_read(&page->_count) != 0);
> -       VM_BUG_ON(page_mapcount(page) < 0);
> -       atomic_inc(&page->first_page->_count);
> -       atomic_inc(&page->_mapcount);
> -}
> -
>  extern int __get_page_tail(struct page *page);
>
>  static inline void get_page(struct page *page)
> diff --git a/mm/internal.h b/mm/internal.h
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -37,6 +37,28 @@ static inline void __put_page(struct pag
>        atomic_dec(&page->_count);
>  }
>
> +static inline void __get_page_tail_foll(struct page *page,
> +                                       bool get_page_head)
> +{
> +       /*
> +        * If we're getting a tail page, the elevated page->_count is
> +        * required only in the head page and we will elevate the head
> +        * page->_count and tail page->_mapcount.
> +        *
> +        * We elevate page_tail->_mapcount for tail pages to force
> +        * page_tail->_count to be zero at all times to avoid getting
> +        * false positives from get_page_unless_zero() with
> +        * speculative page access (like in
> +        * page_cache_get_speculative()) on tail pages.
> +        */
> +       VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> +       VM_BUG_ON(atomic_read(&page->_count) != 0);
> +       VM_BUG_ON(page_mapcount(page) < 0);
> +       if (get_page_head)
> +               atomic_inc(&page->first_page->_count);
> +       atomic_inc(&page->_mapcount);
> +}
> +
>  static inline void get_page_foll(struct page *page)
>  {
>        if (unlikely(PageTail(page)))
> @@ -45,7 +67,7 @@ static inline void get_page_foll(struct
>                 * __split_huge_page_refcount() can't run under
>                 * get_page_foll() because we hold the proper PT lock.
>                 */
> -               __get_page_tail_foll(page);
> +               __get_page_tail_foll(page, true);
>        else {
>                /*
>                 * Getting a normal page or the head of a compound page
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -166,16 +166,8 @@ int __get_page_tail(struct page *page)
>                flags = compound_lock_irqsave(page_head);
>                /* here __split_huge_page_refcount won't run anymore */
>                if (likely(PageTail(page))) {
> -                       __get_page_tail_foll(page);
> +                       __get_page_tail_foll(page, false);
>                        got = 1;
> -                       /*
> -                        * We can release the refcount taken by
> -                        * get_page_unless_zero() now that
> -                        * __split_huge_page_refcount() is blocked on
> -                        * the compound_lock.
> -                        */
> -                       if (put_page_testzero(page_head))
> -                               VM_BUG_ON(1);
>                }
>                compound_unlock_irqrestore(page_head, flags);
>                if (unlikely(!got))
>
>
> ===
> Subject: thp: tail page refcounting fix
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
>
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
>
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
>
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Michel Lespinasse <walken@google.com>

There is a just nitpick at below but the code looks more clear than
old and even fixed bug I missed but Michel found.

Thanks!

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

> @@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
>        unsigned long head_index = page->index;
>        struct zone *zone = page_zone(page);
>        int zonestat;
> +       int tail_count = 0;
>
>        /* prevent PageLRU to go away from under us, and freeze lru stats */
>        spin_lock_irq(&zone->lru_lock);
> @@ -1164,11 +1165,14 @@ static void __split_huge_page_refcount(s
>        for (i = 1; i < HPAGE_PMD_NR; i++) {
>                struct page *page_tail = page + i;
>
> -               /* tail_page->_count cannot change */
> -               atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> -               BUG_ON(page_count(page) <= 0);
> -               atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> -               BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +               /* tail_page->_mapcount cannot change */
> +               BUG_ON(page_mapcount(page_tail) < 0);
> +               tail_count += page_mapcount(page_tail);
> +               /* check for overflow */
> +               BUG_ON(tail_count < 0);
> +               BUG_ON(atomic_read(&page_tail->_count) != 0);
> +               atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> +                          &page_tail->_count);

I doubt someone might try to change this with atomic_set for reducing
LOCK_PREFIX overhead in future although it's not fast path. Of course,
we can prevent that patch but can't prevent his wasted time. I hope
there is a comment why we use atomic_add like the errata PPro with
OOStore.





-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #4
@ 2011-08-29  4:20                       ` Minchan Kim
  0 siblings, 0 replies; 109+ messages in thread
From: Minchan Kim @ 2011-08-29  4:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Sun, Aug 28, 2011 at 2:34 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Sat, Aug 27, 2011 at 02:41:52AM -0700, Michel Lespinasse wrote:
>> I understand you may have to remove the VM_BUG_ON(page_mapcount(page) <= 0)
>> that I had suggested in __get_page_tail() (sorry about that).
>
> The function doing that is snd_pcm_mmap_data_fault and it's doing what
> I described in prev email.
>
>> My only additional suggestion is about the put_page_testzero in
>> __get_page_tail(), maybe if you could just increment the tail page count
>> instead of calling __get_page_tail_foll(), then you wouldn't have to
>> release the extra head page count there. And it would even look kinda
>> natural, head page count gets acquired before compound_lock_irqsave(),
>> so we only have to acquire an extra tail page count after confirming
>> this is still a tail page.
>
> Ok, I added a param to __get_page_tail_foll, it is constant at build
> time and because it's inline the branch should be optimized away at
> build time without requiring a separate function. The bugchecks are
> the same so we can share and just skip the atomic_inc on the
> page_head in __get_page_tail_foll. Also it had to be moved into
> internal.h as a further cleanup.
>
>> Either way, the code looks OK by now.
>>
>> Reviewed-by: Michel Lespinasse <walken@google.com>
>
> Thanks. Incremental diff to correct the false positive bug on for
> drivers like alsa allocating __GFP_COMP and mapping subpages with page
> faults.
>
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -166,12 +166,6 @@ int __get_page_tail(struct page *page)
>                flags = compound_lock_irqsave(page_head);
>                /* here __split_huge_page_refcount won't run anymore */
>                if (likely(PageTail(page))) {
> -                       /*
> -                        * get_page() can only be called on tail pages
> -                        * after get_page_foll() taken a tail page
> -                        * refcount.
> -                        */
> -                       VM_BUG_ON(page_mapcount(page) <= 0);
>                        __get_page_tail_foll(page);
>                        got = 1;
>
>                        /*
>
> This is the optimization.
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -375,26 +375,6 @@ static inline int page_count(struct page
>        return atomic_read(&compound_head(page)->_count);
>  }
>
> -static inline void __get_page_tail_foll(struct page *page)
> -{
> -       /*
> -        * If we're getting a tail page, the elevated page->_count is
> -        * required only in the head page and we will elevate the head
> -        * page->_count and tail page->_mapcount.
> -        *
> -        * We elevate page_tail->_mapcount for tail pages to force
> -        * page_tail->_count to be zero at all times to avoid getting
> -        * false positives from get_page_unless_zero() with
> -        * speculative page access (like in
> -        * page_cache_get_speculative()) on tail pages.
> -        */
> -       VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> -       VM_BUG_ON(atomic_read(&page->_count) != 0);
> -       VM_BUG_ON(page_mapcount(page) < 0);
> -       atomic_inc(&page->first_page->_count);
> -       atomic_inc(&page->_mapcount);
> -}
> -
>  extern int __get_page_tail(struct page *page);
>
>  static inline void get_page(struct page *page)
> diff --git a/mm/internal.h b/mm/internal.h
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -37,6 +37,28 @@ static inline void __put_page(struct pag
>        atomic_dec(&page->_count);
>  }
>
> +static inline void __get_page_tail_foll(struct page *page,
> +                                       bool get_page_head)
> +{
> +       /*
> +        * If we're getting a tail page, the elevated page->_count is
> +        * required only in the head page and we will elevate the head
> +        * page->_count and tail page->_mapcount.
> +        *
> +        * We elevate page_tail->_mapcount for tail pages to force
> +        * page_tail->_count to be zero at all times to avoid getting
> +        * false positives from get_page_unless_zero() with
> +        * speculative page access (like in
> +        * page_cache_get_speculative()) on tail pages.
> +        */
> +       VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> +       VM_BUG_ON(atomic_read(&page->_count) != 0);
> +       VM_BUG_ON(page_mapcount(page) < 0);
> +       if (get_page_head)
> +               atomic_inc(&page->first_page->_count);
> +       atomic_inc(&page->_mapcount);
> +}
> +
>  static inline void get_page_foll(struct page *page)
>  {
>        if (unlikely(PageTail(page)))
> @@ -45,7 +67,7 @@ static inline void get_page_foll(struct
>                 * __split_huge_page_refcount() can't run under
>                 * get_page_foll() because we hold the proper PT lock.
>                 */
> -               __get_page_tail_foll(page);
> +               __get_page_tail_foll(page, true);
>        else {
>                /*
>                 * Getting a normal page or the head of a compound page
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -166,16 +166,8 @@ int __get_page_tail(struct page *page)
>                flags = compound_lock_irqsave(page_head);
>                /* here __split_huge_page_refcount won't run anymore */
>                if (likely(PageTail(page))) {
> -                       __get_page_tail_foll(page);
> +                       __get_page_tail_foll(page, false);
>                        got = 1;
> -                       /*
> -                        * We can release the refcount taken by
> -                        * get_page_unless_zero() now that
> -                        * __split_huge_page_refcount() is blocked on
> -                        * the compound_lock.
> -                        */
> -                       if (put_page_testzero(page_head))
> -                               VM_BUG_ON(1);
>                }
>                compound_unlock_irqrestore(page_head, flags);
>                if (unlikely(!got))
>
>
> ===
> Subject: thp: tail page refcounting fix
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
>
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
>
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
>
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Michel Lespinasse <walken@google.com>

There is a just nitpick at below but the code looks more clear than
old and even fixed bug I missed but Michel found.

Thanks!

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

> @@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
>        unsigned long head_index = page->index;
>        struct zone *zone = page_zone(page);
>        int zonestat;
> +       int tail_count = 0;
>
>        /* prevent PageLRU to go away from under us, and freeze lru stats */
>        spin_lock_irq(&zone->lru_lock);
> @@ -1164,11 +1165,14 @@ static void __split_huge_page_refcount(s
>        for (i = 1; i < HPAGE_PMD_NR; i++) {
>                struct page *page_tail = page + i;
>
> -               /* tail_page->_count cannot change */
> -               atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> -               BUG_ON(page_count(page) <= 0);
> -               atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> -               BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +               /* tail_page->_mapcount cannot change */
> +               BUG_ON(page_mapcount(page_tail) < 0);
> +               tail_count += page_mapcount(page_tail);
> +               /* check for overflow */
> +               BUG_ON(tail_count < 0);
> +               BUG_ON(atomic_read(&page_tail->_count) != 0);
> +               atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> +                          &page_tail->_count);

I doubt someone might try to change this with atomic_set for reducing
LOCK_PREFIX overhead in future although it's not fast path. Of course,
we can prevent that patch but can't prevent his wasted time. I hope
there is a comment why we use atomic_add like the errata PPro with
OOStore.





-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #4
  2011-08-27 17:34                     ` Andrea Arcangeli
@ 2011-08-29 22:40                       ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-29 22:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Sat, Aug 27, 2011 at 10:34 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Subject: thp: tail page refcounting fix
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
>
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
>
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
>
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Michel Lespinasse <walken@google.com>

Looks great.

I think some page_mapcount call sites would be easier to read if you
took on my tail_page_count() suggestion (so we can casually see it's a
refcount rather than mapcount). But you don't have to do it if you
don't think it helps. I'm happy enough with the code already :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #4
@ 2011-08-29 22:40                       ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-08-29 22:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Sat, Aug 27, 2011 at 10:34 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Subject: thp: tail page refcounting fix
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
>
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
>
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
>
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Michel Lespinasse <walken@google.com>

Looks great.

I think some page_mapcount call sites would be easier to read if you
took on my tail_page_count() suggestion (so we can casually see it's a
refcount rather than mapcount). But you don't have to do it if you
don't think it helps. I'm happy enough with the code already :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #4
  2011-08-29 22:40                       ` Michel Lespinasse
@ 2011-08-29 23:30                         ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-29 23:30 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Mon, Aug 29, 2011 at 03:40:07PM -0700, Michel Lespinasse wrote:
> Looks great.

Thanks!

> I think some page_mapcount call sites would be easier to read if you
> took on my tail_page_count() suggestion (so we can casually see it's a
> refcount rather than mapcount). But you don't have to do it if you
> don't think it helps. I'm happy enough with the code already :)

I initially tried to do it but I wanted it in internal.h (it's really
an internal thing not mean for any driver whatsoever) but then the
gup.c files wouldn't see it, so I wasn't sure how to proceed and I
dropped it. It's still possible to do it later...

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #4
@ 2011-08-29 23:30                         ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-08-29 23:30 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Mon, Aug 29, 2011 at 03:40:07PM -0700, Michel Lespinasse wrote:
> Looks great.

Thanks!

> I think some page_mapcount call sites would be easier to read if you
> took on my tail_page_count() suggestion (so we can casually see it's a
> refcount rather than mapcount). But you don't have to do it if you
> don't think it helps. I'm happy enough with the code already :)

I initially tried to do it but I wanted it in internal.h (it's really
an internal thing not mean for any driver whatsoever) but then the
gup.c files wouldn't see it, so I wasn't sure how to proceed and I
dropped it. It's still possible to do it later...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #5
  2011-08-29  4:20                       ` Minchan Kim
@ 2011-09-01 15:24                         ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-09-01 15:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Mon, Aug 29, 2011 at 01:20:26PM +0900, Minchan Kim wrote:
> I doubt someone might try to change this with atomic_set for reducing
> LOCK_PREFIX overhead in future although it's not fast path. Of course,
> we can prevent that patch but can't prevent his wasted time. I hope
> there is a comment why we use atomic_add like the errata PPro with
> OOStore.

Sure good idea. I was waiting Michel ack before sending a further
update to add the comment. Because it seems the code settled down
(it's running stable without the slightest problem on all my systems)
I think I can send a final version with the comment too.

this is the diff between #4 and #5.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1171,6 +1171,19 @@ static void __split_huge_page_refcount(s
 		/* check for overflow */
 		BUG_ON(tail_count < 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		/*
+		 * tail_page->_count is zero and not changing from
+		 * under us. But get_page_unless_zero() may be running
+		 * from under us on the tail_page. If we used
+		 * atomic_set() below instead of atomic_add(), we
+		 * would then run atomic_set() concurrently with
+		 * get_page_unless_zero(), and atomic_set() is
+		 * implemented in C not using locked ops. spin_unlock
+		 * on x86 sometime uses locked ops because of PPro
+		 * errata 66, 92, so unless somebody can guarantee
+		 * atomic_set() here would be safe on all archs (and
+		 * not only on x86), it's safer to use atomic_add().
+		 */
 		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
 			   &page_tail->_count);
 
===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,36 +356,39 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
+extern int __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -804,21 +807,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -62,10 +62,23 @@ struct page {
 			struct {
 
 				union {
-					atomic_t _mapcount;	/* Count of ptes mapped in mms,
-							 * to show when page is mapped
-							 * & limit reverse map searches.
-							 */
+					/*
+					 * Count of ptes mapped in
+					 * mms, to show when page is
+					 * mapped & limit reverse map
+					 * searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
+					 */
+					atomic_t _mapcount;
 
 					struct {
 						unsigned inuse:16;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1164,11 +1165,27 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		/*
+		 * tail_page->_count is zero and not changing from
+		 * under us. But get_page_unless_zero() may be running
+		 * from under us on the tail_page. If we used
+		 * atomic_set() below instead of atomic_add(), we
+		 * would then run atomic_set() concurrently with
+		 * get_page_unless_zero(), and atomic_set() is
+		 * implemented in C not using locked ops. spin_unlock
+		 * on x86 sometime uses locked ops because of PPro
+		 * errata 66, 92, so unless somebody can guarantee
+		 * atomic_set() here would be safe on all archs (and
+		 * not only on x86), it's safer to use atomic_add().
+		 */
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1186,10 +1203,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1206,7 +1220,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
@@ -1223,6 +1236,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,47 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page, true);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1503,7 +1503,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,21 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,16 +103,17 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +143,40 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *page_head = compound_trans_head(page);
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = 1;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #5
@ 2011-09-01 15:24                         ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-09-01 15:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Mon, Aug 29, 2011 at 01:20:26PM +0900, Minchan Kim wrote:
> I doubt someone might try to change this with atomic_set for reducing
> LOCK_PREFIX overhead in future although it's not fast path. Of course,
> we can prevent that patch but can't prevent his wasted time. I hope
> there is a comment why we use atomic_add like the errata PPro with
> OOStore.

Sure good idea. I was waiting Michel ack before sending a further
update to add the comment. Because it seems the code settled down
(it's running stable without the slightest problem on all my systems)
I think I can send a final version with the comment too.

this is the diff between #4 and #5.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1171,6 +1171,19 @@ static void __split_huge_page_refcount(s
 		/* check for overflow */
 		BUG_ON(tail_count < 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		/*
+		 * tail_page->_count is zero and not changing from
+		 * under us. But get_page_unless_zero() may be running
+		 * from under us on the tail_page. If we used
+		 * atomic_set() below instead of atomic_add(), we
+		 * would then run atomic_set() concurrently with
+		 * get_page_unless_zero(), and atomic_set() is
+		 * implemented in C not using locked ops. spin_unlock
+		 * on x86 sometime uses locked ops because of PPro
+		 * errata 66, 92, so unless somebody can guarantee
+		 * atomic_set() here would be safe on all archs (and
+		 * not only on x86), it's safer to use atomic_add().
+		 */
 		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
 			   &page_tail->_count);
 
===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,36 +356,39 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
+extern int __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -804,21 +807,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -62,10 +62,23 @@ struct page {
 			struct {
 
 				union {
-					atomic_t _mapcount;	/* Count of ptes mapped in mms,
-							 * to show when page is mapped
-							 * & limit reverse map searches.
-							 */
+					/*
+					 * Count of ptes mapped in
+					 * mms, to show when page is
+					 * mapped & limit reverse map
+					 * searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
+					 */
+					atomic_t _mapcount;
 
 					struct {
 						unsigned inuse:16;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1164,11 +1165,27 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		/*
+		 * tail_page->_count is zero and not changing from
+		 * under us. But get_page_unless_zero() may be running
+		 * from under us on the tail_page. If we used
+		 * atomic_set() below instead of atomic_add(), we
+		 * would then run atomic_set() concurrently with
+		 * get_page_unless_zero(), and atomic_set() is
+		 * implemented in C not using locked ops. spin_unlock
+		 * on x86 sometime uses locked ops because of PPro
+		 * errata 66, 92, so unless somebody can guarantee
+		 * atomic_set() here would be safe on all archs (and
+		 * not only on x86), it's safer to use atomic_add().
+		 */
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1186,10 +1203,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1206,7 +1220,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
@@ -1223,6 +1236,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,47 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page, true);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1503,7 +1503,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,21 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,16 +103,17 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +143,40 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	int got = 0;
+	struct page *page_head = compound_trans_head(page);
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = 1;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
  2011-09-01 15:24                         ` Andrea Arcangeli
@ 2011-09-01 22:27                           ` Michel Lespinasse
  -1 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-09-01 22:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Thu, Sep 1, 2011 at 8:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Subject: thp: tail page refcounting fix
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
>
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
>
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
>
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Looks great. Thanks :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
@ 2011-09-01 22:27                           ` Michel Lespinasse
  0 siblings, 0 replies; 109+ messages in thread
From: Michel Lespinasse @ 2011-09-01 22:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Shaohua Li, Paul E. McKenney

On Thu, Sep 1, 2011 at 8:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Subject: thp: tail page refcounting fix
>
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
>
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
>
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
>
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Looks great. Thanks :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
  2011-09-01 15:24                         ` Andrea Arcangeli
@ 2011-09-01 23:28                           ` Andrew Morton
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrew Morton @ 2011-09-01 23:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Michel Lespinasse, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney, Andi Kleen

On Thu, 1 Sep 2011 17:24:17 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages().

Yeah.  get_user_pages() is sufficient.  Ideally we should be able to
undo the get_user_pages() get_page() from within the IO completion
interrupt and we're done.

Cc Andi, who is our resident dio tweaker ;)


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
@ 2011-09-01 23:28                           ` Andrew Morton
  0 siblings, 0 replies; 109+ messages in thread
From: Andrew Morton @ 2011-09-01 23:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Michel Lespinasse, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney, Andi Kleen

On Thu, 1 Sep 2011 17:24:17 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages().

Yeah.  get_user_pages() is sufficient.  Ideally we should be able to
undo the get_user_pages() get_page() from within the IO completion
interrupt and we're done.

Cc Andi, who is our resident dio tweaker ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
  2011-09-01 23:28                           ` Andrew Morton
@ 2011-09-01 23:45                             ` Andi Kleen
  -1 siblings, 0 replies; 109+ messages in thread
From: Andi Kleen @ 2011-09-01 23:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Andi Kleen

On Thu, Sep 01, 2011 at 04:28:08PM -0700, Andrew Morton wrote:
> On Thu, 1 Sep 2011 17:24:17 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > Ideally direct-io should stop calling get_page() on pages
> > returned by get_user_pages().
> 
> Yeah.  get_user_pages() is sufficient.  Ideally we should be able to
> undo the get_user_pages() get_page() from within the IO completion
> interrupt and we're done.
> 
> Cc Andi, who is our resident dio tweaker ;)
> 

Noted, I'll put it on my list.

Should not be too difficult from a quick look, just the convoluted
nature of direct-io.c requires a lot of double checking.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
@ 2011-09-01 23:45                             ` Andi Kleen
  0 siblings, 0 replies; 109+ messages in thread
From: Andi Kleen @ 2011-09-01 23:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Andi Kleen

On Thu, Sep 01, 2011 at 04:28:08PM -0700, Andrew Morton wrote:
> On Thu, 1 Sep 2011 17:24:17 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > Ideally direct-io should stop calling get_page() on pages
> > returned by get_user_pages().
> 
> Yeah.  get_user_pages() is sufficient.  Ideally we should be able to
> undo the get_user_pages() get_page() from within the IO completion
> interrupt and we're done.
> 
> Cc Andi, who is our resident dio tweaker ;)
> 

Noted, I'll put it on my list.

Should not be too difficult from a quick look, just the convoluted
nature of direct-io.c requires a lot of double checking.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
  2011-09-01 15:24                         ` Andrea Arcangeli
@ 2011-09-02  0:03                           ` Andrew Morton
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrew Morton @ 2011-09-02  0:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Michel Lespinasse, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Thu, 1 Sep 2011 17:24:17 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
> 
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
> 
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
> 
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
> 

The patch overall takes the x86_64 allmodconfig text size of
arch/x86/mm/gup.o, mm/huge_memory.o, mm/memory.o and mm/swap.o from a
total of 85059 bytes up to 85973.  That's quite a lot of bloat for a
pretty small patch.

I'm suspecting that most of this is due to the new inlined
get_page_foll(), which is large enough to squish an elephant.  Could
you please take a look at reducing this impact?

>
> ...
>
> +/*
> + * The atomic page->_mapcount, starts from -1: so that transitions
> + * both from it and to it can be tracked, using atomic_inc_and_test
> + * and atomic_add_negative(-1).
> + */
> +static inline void reset_page_mapcount(struct page *page)

I think we should have originally named this page_mapcount_reset() This
is extra unimportant as it's a static symbol.

>
> ...
>
>  static inline void get_page(struct page *page)
>  {
> +	if (unlikely(PageTail(page)))
> +		if (likely(__get_page_tail(page)))
> +			return;

OK so we still have approximately one test-n-branch in the non-debug
get_page().

>  	/*
>  	 * Getting a normal page or the head of a compound page
> -	 * requires to already have an elevated page->_count. Only if
> -	 * we're getting a tail page, the elevated page->_count is
> -	 * required only in the head page, so for tail pages the
> -	 * bugcheck only verifies that the page->_count isn't
> -	 * negative.
> +	 * requires to already have an elevated page->_count.
>  	 */
> -	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> +	VM_BUG_ON(atomic_read(&page->_count) <= 0);

I wonder how many people enable VM_BUG_ON().  We're pretty profligate
with those things in hot paths.

>  	atomic_inc(&page->_count);
> -	/*
> -	 * Getting a tail page will elevate both the head and tail
> -	 * page->_count(s).
> -	 */
> -	if (unlikely(PageTail(page))) {
> -		/*
> -		 * This is safe only because
> -		 * __split_huge_page_refcount can't run under
> -		 * get_page().
> -		 */
> -		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> -		atomic_inc(&page->first_page->_count);
> -	}
>  }
>  
>  static inline struct page *virt_to_head_page(const void *x)
>
> ...
>
> +int __get_page_tail(struct page *page)
> +{
> +	/*
> +	 * This takes care of get_page() if run on a tail page
> +	 * returned by one of the get_user_pages/follow_page variants.
> +	 * get_user_pages/follow_page itself doesn't need the compound
> +	 * lock because it runs __get_page_tail_foll() under the
> +	 * proper PT lock that already serializes against
> +	 * split_huge_page().
> +	 */
> +	unsigned long flags;
> +	int got = 0;

Could be a bool if you like that sort of thing..

> +	struct page *page_head = compound_trans_head(page);

Missing newline here

> +	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> +		/*
> +		 * page_head wasn't a dangling pointer but it
> +		 * may not be a head page anymore by the time
> +		 * we obtain the lock. That is ok as long as it
> +		 * can't be freed from under us.
> +		 */
> +		flags = compound_lock_irqsave(page_head);
> +		/* here __split_huge_page_refcount won't run anymore */
> +		if (likely(PageTail(page))) {
> +			__get_page_tail_foll(page, false);
> +			got = 1;
> +		}
> +		compound_unlock_irqrestore(page_head, flags);
> +		if (unlikely(!got))
> +			put_page(page_head);
> +	}
> +	return got;
> +}
> +EXPORT_SYMBOL(__get_page_tail);

Ordinarily I'd squeak about a global, exported-to-modules function
which is undocumented.  But this one is internal to get_page(), so it's
less necessary.

Still, documenting at least the return value (the "why" rather than the
"what") would make get_page() more understandable.

>
> ...
>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
@ 2011-09-02  0:03                           ` Andrew Morton
  0 siblings, 0 replies; 109+ messages in thread
From: Andrew Morton @ 2011-09-02  0:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Michel Lespinasse, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Thu, 1 Sep 2011 17:24:17 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Michel while working on the working set estimation code, noticed that calling
> get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
> pfn ended up being a tail page of a transparent hugepage under splitting by
> __split_huge_page_refcount(). He then found the problem could also
> theoretically materialize with page_cache_get_speculative() during the
> speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
> radix tree page is freed and reallocated and get_user_pages is called on it
> before page_cache_get_speculative has a chance to call get_page_unless_zero().
> 
> So the best way to fix the problem is to keep page_tail->_count zero at all
> times. This will guarantee that get_page_unless_zero() can never succeed on any
> tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
> pages of a compound page, so we can simply account the tail page references
> there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
> addition to the head_page->_mapcount).
> 
> While debugging this s/_count/_mapcount/ change I also noticed get_page is
> called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
> safe because the two atomic_inc in get_page weren't atomic. As opposed other
> get_user_page users like secondary-MMU page fault to establish the shadow
> pagetables would never call any superflous get_page after get_user_page
> returns. It's safer to make get_page universally safe for tail pages and to use
> get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
> is safe to do the refcounting for tail pages without taking any locks because
> it is run within PT lock protected critical sections (PT lock for pte and
> page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
> direct-io instead will now take the compound_lock but still only for tail
> pages. The direct-io paths are usually I/O bound and the compound_lock is per
> THP so very finegrined, so there's no risk of scalability issues with it. A
> simple direct-io benchmarks with all lockdep prove locking and spinlock
> debugging infrastructure enabled shows identical performance and no overhead.
> So it's worth it. Ideally direct-io should stop calling get_page() on pages
> returned by get_user_pages(). The spinlock in get_page() is already optimized
> away for no-THP builds but doing get_page() on tail pages returned by GUP is
> generally a rare operation and usually only run in I/O paths.
> 
> This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
> critical sections will also allow the working set estimation code to work
> without any further complexity associated to the tail page refcounting
> with THP.
> 

The patch overall takes the x86_64 allmodconfig text size of
arch/x86/mm/gup.o, mm/huge_memory.o, mm/memory.o and mm/swap.o from a
total of 85059 bytes up to 85973.  That's quite a lot of bloat for a
pretty small patch.

I'm suspecting that most of this is due to the new inlined
get_page_foll(), which is large enough to squish an elephant.  Could
you please take a look at reducing this impact?

>
> ...
>
> +/*
> + * The atomic page->_mapcount, starts from -1: so that transitions
> + * both from it and to it can be tracked, using atomic_inc_and_test
> + * and atomic_add_negative(-1).
> + */
> +static inline void reset_page_mapcount(struct page *page)

I think we should have originally named this page_mapcount_reset() This
is extra unimportant as it's a static symbol.

>
> ...
>
>  static inline void get_page(struct page *page)
>  {
> +	if (unlikely(PageTail(page)))
> +		if (likely(__get_page_tail(page)))
> +			return;

OK so we still have approximately one test-n-branch in the non-debug
get_page().

>  	/*
>  	 * Getting a normal page or the head of a compound page
> -	 * requires to already have an elevated page->_count. Only if
> -	 * we're getting a tail page, the elevated page->_count is
> -	 * required only in the head page, so for tail pages the
> -	 * bugcheck only verifies that the page->_count isn't
> -	 * negative.
> +	 * requires to already have an elevated page->_count.
>  	 */
> -	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> +	VM_BUG_ON(atomic_read(&page->_count) <= 0);

I wonder how many people enable VM_BUG_ON().  We're pretty profligate
with those things in hot paths.

>  	atomic_inc(&page->_count);
> -	/*
> -	 * Getting a tail page will elevate both the head and tail
> -	 * page->_count(s).
> -	 */
> -	if (unlikely(PageTail(page))) {
> -		/*
> -		 * This is safe only because
> -		 * __split_huge_page_refcount can't run under
> -		 * get_page().
> -		 */
> -		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> -		atomic_inc(&page->first_page->_count);
> -	}
>  }
>  
>  static inline struct page *virt_to_head_page(const void *x)
>
> ...
>
> +int __get_page_tail(struct page *page)
> +{
> +	/*
> +	 * This takes care of get_page() if run on a tail page
> +	 * returned by one of the get_user_pages/follow_page variants.
> +	 * get_user_pages/follow_page itself doesn't need the compound
> +	 * lock because it runs __get_page_tail_foll() under the
> +	 * proper PT lock that already serializes against
> +	 * split_huge_page().
> +	 */
> +	unsigned long flags;
> +	int got = 0;

Could be a bool if you like that sort of thing..

> +	struct page *page_head = compound_trans_head(page);

Missing newline here

> +	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> +		/*
> +		 * page_head wasn't a dangling pointer but it
> +		 * may not be a head page anymore by the time
> +		 * we obtain the lock. That is ok as long as it
> +		 * can't be freed from under us.
> +		 */
> +		flags = compound_lock_irqsave(page_head);
> +		/* here __split_huge_page_refcount won't run anymore */
> +		if (likely(PageTail(page))) {
> +			__get_page_tail_foll(page, false);
> +			got = 1;
> +		}
> +		compound_unlock_irqrestore(page_head, flags);
> +		if (unlikely(!got))
> +			put_page(page_head);
> +	}
> +	return got;
> +}
> +EXPORT_SYMBOL(__get_page_tail);

Ordinarily I'd squeak about a global, exported-to-modules function
which is undocumented.  But this one is internal to get_page(), so it's
less necessary.

Still, documenting at least the return value (the "why" rather than the
"what") would make get_page() more understandable.

>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
  2011-09-01 23:45                             ` Andi Kleen
@ 2011-09-02  0:20                               ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-09-02  0:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Fri, Sep 02, 2011 at 01:45:27AM +0200, Andi Kleen wrote:
> On Thu, Sep 01, 2011 at 04:28:08PM -0700, Andrew Morton wrote:
> > On Thu, 1 Sep 2011 17:24:17 +0200
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > > Ideally direct-io should stop calling get_page() on pages
> > > returned by get_user_pages().
> > 
> > Yeah.  get_user_pages() is sufficient.  Ideally we should be able to
> > undo the get_user_pages() get_page() from within the IO completion
> > interrupt and we're done.
> > 
> > Cc Andi, who is our resident dio tweaker ;)
>
> Noted, I'll put it on my list.

Thanks Andi!

> Should not be too difficult from a quick look, just the convoluted
> nature of direct-io.c requires a lot of double checking.

I also had a look but it wasn't trivial, I'm not even sure why
direct-io.c has to be convoluted.

If we could optimize that, we would stay within get_page_foll() which
won't need to take the compound_lock even for tail
pages. (compound_lock can't be avoided for put_page on tail pages
because it runs long after we release any VM lock)

Calling get_page/put_pages more times than necessary is never ideal, I
imagine the biggest cost is the atomic_inc on the head page that
brings in the cacheline of the head page exclusive, the compound_lock
in the second get_page shouldn't have a measurable effect, so I think
from a practical prospective it's not more worthwhile to optimize
that now, than it already was before.

> Cc Andi, who is our resident dio tweaker ;)

Thanks :)

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
@ 2011-09-02  0:20                               ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-09-02  0:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Fri, Sep 02, 2011 at 01:45:27AM +0200, Andi Kleen wrote:
> On Thu, Sep 01, 2011 at 04:28:08PM -0700, Andrew Morton wrote:
> > On Thu, 1 Sep 2011 17:24:17 +0200
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > > Ideally direct-io should stop calling get_page() on pages
> > > returned by get_user_pages().
> > 
> > Yeah.  get_user_pages() is sufficient.  Ideally we should be able to
> > undo the get_user_pages() get_page() from within the IO completion
> > interrupt and we're done.
> > 
> > Cc Andi, who is our resident dio tweaker ;)
>
> Noted, I'll put it on my list.

Thanks Andi!

> Should not be too difficult from a quick look, just the convoluted
> nature of direct-io.c requires a lot of double checking.

I also had a look but it wasn't trivial, I'm not even sure why
direct-io.c has to be convoluted.

If we could optimize that, we would stay within get_page_foll() which
won't need to take the compound_lock even for tail
pages. (compound_lock can't be avoided for put_page on tail pages
because it runs long after we release any VM lock)

Calling get_page/put_pages more times than necessary is never ideal, I
imagine the biggest cost is the atomic_inc on the head page that
brings in the cacheline of the head page exclusive, the compound_lock
in the second get_page shouldn't have a measurable effect, so I think
from a practical prospective it's not more worthwhile to optimize
that now, than it already was before.

> Cc Andi, who is our resident dio tweaker ;)

Thanks :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
  2011-09-02  0:20                               ` Andrea Arcangeli
@ 2011-09-02  1:17                                 ` Andi Kleen
  -1 siblings, 0 replies; 109+ messages in thread
From: Andi Kleen @ 2011-09-02  1:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, Andrew Morton, Minchan Kim, Michel Lespinasse,
	linux-mm, linux-kernel, Hugh Dickins, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li,
	Paul E. McKenney

> Calling get_page/put_pages more times than necessary is never ideal, I
> imagine the biggest cost is the atomic_inc on the head page that

I've actually seen it in profile logs, but I hadn't realized it was redundant.

Have to see if it brings a benefit to hot users.

-Andi

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #5
@ 2011-09-02  1:17                                 ` Andi Kleen
  0 siblings, 0 replies; 109+ messages in thread
From: Andi Kleen @ 2011-09-02  1:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, Andrew Morton, Minchan Kim, Michel Lespinasse,
	linux-mm, linux-kernel, Hugh Dickins, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Shaohua Li,
	Paul E. McKenney

> Calling get_page/put_pages more times than necessary is never ideal, I
> imagine the biggest cost is the atomic_inc on the head page that

I've actually seen it in profile logs, but I hadn't realized it was redundant.

Have to see if it brings a benefit to hot users.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #6
  2011-09-02  0:03                           ` Andrew Morton
@ 2011-09-08 16:51                             ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-09-08 16:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Michel Lespinasse, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Thu, Sep 01, 2011 at 05:03:53PM -0700, Andrew Morton wrote:
> The patch overall takes the x86_64 allmodconfig text size of
> arch/x86/mm/gup.o, mm/huge_memory.o, mm/memory.o and mm/swap.o from a
> total of 85059 bytes up to 85973.  That's quite a lot of bloat for a
> pretty small patch.
> 
> I'm suspecting that most of this is due to the new inlined
> get_page_foll(), which is large enough to squish an elephant.  Could
> you please take a look at reducing this impact?

It can't be get_page_foll, it's called only twice... apparently the
change to get_page is the cause of this.

The below is done with DEBUG_VM=n (we don't care that much about
DEBUG_VM that shouldn't be used by any distro and it's ok if it's a
big larger) and without gcc -Os (i.e. CONFIG_CC_OPTIMIZE_FOR_SIZE=n
which I think is the best setting, I don't trust -Os to be optimal).

before:

make-distcc -j32 arch/x86/mm/gup.o mm/huge_memory.o mm/memory.o; objdump -h -j .text mm/huge_memory.o mm/memory.o arch/x86/mm/gup.o|grep ' .text'| awk '{ print $3 }' | (x=0; while read i; do x=`expr $x + $[0x$i]`; done; echo $x)

39314

after:

39474

If I put get_page_foll out of line (i.e. moving it from mm/internal.h
to mm/swap.c):

39410

If I put get_page out of line and I keep get_page_foll in line:

39042

I tried to recraft get_page in various ways but gcc always builds it
with 39474 size.. and in fact even worse if I remove the unlikely
(even tried with likely and goto to jump from the slow path into the
fast path to do atomic_inc, it didn't change a single byte).

So it's sad but I'm afraid we'll have to live with the code bloat
unless we want to issue a "call+branch+lock incl+ret" instead of just
a branch+lock incl.

Ideally the larger code size should be out of line so I don't think
it's a problem other than for the wasted bytes of ram and disk. (but
hey it expands all over the vmlinux exactly because it's get_page so
it's more than just an hundred bytes). It's certainly better to waste
some kbyte regardless of the number of pages initialized by the
kernel, than to run slower.

> > +/*
> > + * The atomic page->_mapcount, starts from -1: so that transitions
> > + * both from it and to it can be tracked, using atomic_inc_and_test
> > + * and atomic_add_negative(-1).
> > + */
> > +static inline void reset_page_mapcount(struct page *page)
> 
> I think we should have originally named this page_mapcount_reset() This
> is extra unimportant as it's a static symbol.

Do you like me to make a patch to rename it? I guess it shall be an
incremental patch or it'll do too many things at the same time.

> >  static inline void get_page(struct page *page)
> >  {
> > +	if (unlikely(PageTail(page)))
> > +		if (likely(__get_page_tail(page)))
> > +			return;
> 
> OK so we still have approximately one test-n-branch in the non-debug
> get_page().

Like before yes.

> >  	 * Getting a normal page or the head of a compound page
> > -	 * requires to already have an elevated page->_count. Only if
> > -	 * we're getting a tail page, the elevated page->_count is
> > -	 * required only in the head page, so for tail pages the
> > -	 * bugcheck only verifies that the page->_count isn't
> > -	 * negative.
> > +	 * requires to already have an elevated page->_count.
> >  	 */
> > -	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> > +	VM_BUG_ON(atomic_read(&page->_count) <= 0);
> 
> I wonder how many people enable VM_BUG_ON().  We're pretty profligate
> with those things in hot paths.

I think these are good to keep for now. They're VM_ because it's a
fast path so it's ok. the other ones are in __spit_huge_page_refcount
which is not a critical fast path and those aren't VM_ because it's
worth checking we don't get anything wrong there.

> > +int __get_page_tail(struct page *page)
> > +{
> > +	/*
> > +	 * This takes care of get_page() if run on a tail page
> > +	 * returned by one of the get_user_pages/follow_page variants.
> > +	 * get_user_pages/follow_page itself doesn't need the compound
> > +	 * lock because it runs __get_page_tail_foll() under the
> > +	 * proper PT lock that already serializes against
> > +	 * split_huge_page().
> > +	 */
> > +	unsigned long flags;
> > +	int got = 0;
> 
> Could be a bool if you like that sort of thing..

I like it. I just didn't think of using it.

> > +	struct page *page_head = compound_trans_head(page);
> 
> Missing newline here

Hmm ok.

> > +	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> > +		/*
> > +		 * page_head wasn't a dangling pointer but it
> > +		 * may not be a head page anymore by the time
> > +		 * we obtain the lock. That is ok as long as it
> > +		 * can't be freed from under us.
> > +		 */
> > +		flags = compound_lock_irqsave(page_head);
> > +		/* here __split_huge_page_refcount won't run anymore */
> > +		if (likely(PageTail(page))) {
> > +			__get_page_tail_foll(page, false);
> > +			got = 1;
> > +		}
> > +		compound_unlock_irqrestore(page_head, flags);
> > +		if (unlikely(!got))
> > +			put_page(page_head);
> > +	}
> > +	return got;
> > +}
> > +EXPORT_SYMBOL(__get_page_tail);
> 
> Ordinarily I'd squeak about a global, exported-to-modules function
> which is undocumented.  But this one is internal to get_page(), so it's
> less necessary.
> 
> Still, documenting at least the return value (the "why" rather than the
> "what") would make get_page() more understandable.

I think it's good idea to document it should not ever be called and
it's meant only to be called only by get_page.

Thanks a lot for the review. It's unfortunate I didn't find a way to
shrink the .text without putting get_page out of line with this logic,
this is the reason of the delay in the answer but if somebody has
better ideas let me know...

Here a new version with this incremental diff:

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -376,7 +376,7 @@ static inline int page_count(struct page
 	return atomic_read(&compound_head(page)->_count);
 }
 
-extern int __get_page_tail(struct page *page);
+extern bool __get_page_tail(struct page *page);
 
 static inline void get_page(struct page *page)
 {
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -59,6 +59,11 @@ static inline void __get_page_tail_foll(
 	atomic_inc(&page->_mapcount);
 }
 
+/*
+ * This is meant to be called as the FOLL_GET operation of
+ * follow_page() and it must be called while holding the proper PT
+ * lock while the pte (or pmd_trans_huge) is still mapping the page.
+ */
 static inline void get_page_foll(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,6 +79,7 @@ static void put_compound_page(struct pag
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
 		struct page *page_head = compound_trans_head(page);
+
 		if (likely(page != page_head &&
 			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
@@ -143,7 +144,11 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
-int __get_page_tail(struct page *page)
+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+bool __get_page_tail(struct page *page)
 {
 	/*
 	 * This takes care of get_page() if run on a tail page
@@ -154,8 +159,9 @@ int __get_page_tail(struct page *page)
 	 * split_huge_page().
 	 */
 	unsigned long flags;
-	int got = 0;
+	bool got = false;
 	struct page *page_head = compound_trans_head(page);
+
 	if (likely(page != page_head && get_page_unless_zero(page_head))) {
 		/*
 		 * page_head wasn't a dangling pointer but it
@@ -167,7 +173,7 @@ int __get_page_tail(struct page *page)
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
 			__get_page_tail_foll(page, false);
-			got = 1;
+			got = true;
 		}
 		compound_unlock_irqrestore(page_head, flags);
 		if (unlikely(!got))

===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,36 +356,39 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
+extern bool __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -804,21 +807,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -62,10 +62,23 @@ struct page {
 			struct {
 
 				union {
-					atomic_t _mapcount;	/* Count of ptes mapped in mms,
-							 * to show when page is mapped
-							 * & limit reverse map searches.
-							 */
+					/*
+					 * Count of ptes mapped in
+					 * mms, to show when page is
+					 * mapped & limit reverse map
+					 * searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
+					 */
+					atomic_t _mapcount;
 
 					struct {
 						unsigned inuse:16;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1164,11 +1165,27 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		/*
+		 * tail_page->_count is zero and not changing from
+		 * under us. But get_page_unless_zero() may be running
+		 * from under us on the tail_page. If we used
+		 * atomic_set() below instead of atomic_add(), we
+		 * would then run atomic_set() concurrently with
+		 * get_page_unless_zero(), and atomic_set() is
+		 * implemented in C not using locked ops. spin_unlock
+		 * on x86 sometime uses locked ops because of PPro
+		 * errata 66, 92, so unless somebody can guarantee
+		 * atomic_set() here would be safe on all archs (and
+		 * not only on x86), it's safer to use atomic_add().
+		 */
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1186,10 +1203,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1206,7 +1220,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
@@ -1223,6 +1236,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,52 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+/*
+ * This is meant to be called as the FOLL_GET operation of
+ * follow_page() and it must be called while holding the proper PT
+ * lock while the pte (or pmd_trans_huge) is still mapping the page.
+ */
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page, true);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1503,7 +1503,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,22 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,16 +104,17 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +144,45 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+bool __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	bool got = false;
+	struct page *page_head = compound_trans_head(page);
+
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = true;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] thp: tail page refcounting fix #6
@ 2011-09-08 16:51                             ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-09-08 16:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Michel Lespinasse, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Thu, Sep 01, 2011 at 05:03:53PM -0700, Andrew Morton wrote:
> The patch overall takes the x86_64 allmodconfig text size of
> arch/x86/mm/gup.o, mm/huge_memory.o, mm/memory.o and mm/swap.o from a
> total of 85059 bytes up to 85973.  That's quite a lot of bloat for a
> pretty small patch.
> 
> I'm suspecting that most of this is due to the new inlined
> get_page_foll(), which is large enough to squish an elephant.  Could
> you please take a look at reducing this impact?

It can't be get_page_foll, it's called only twice... apparently the
change to get_page is the cause of this.

The below is done with DEBUG_VM=n (we don't care that much about
DEBUG_VM that shouldn't be used by any distro and it's ok if it's a
big larger) and without gcc -Os (i.e. CONFIG_CC_OPTIMIZE_FOR_SIZE=n
which I think is the best setting, I don't trust -Os to be optimal).

before:

make-distcc -j32 arch/x86/mm/gup.o mm/huge_memory.o mm/memory.o; objdump -h -j .text mm/huge_memory.o mm/memory.o arch/x86/mm/gup.o|grep ' .text'| awk '{ print $3 }' | (x=0; while read i; do x=`expr $x + $[0x$i]`; done; echo $x)

39314

after:

39474

If I put get_page_foll out of line (i.e. moving it from mm/internal.h
to mm/swap.c):

39410

If I put get_page out of line and I keep get_page_foll in line:

39042

I tried to recraft get_page in various ways but gcc always builds it
with 39474 size.. and in fact even worse if I remove the unlikely
(even tried with likely and goto to jump from the slow path into the
fast path to do atomic_inc, it didn't change a single byte).

So it's sad but I'm afraid we'll have to live with the code bloat
unless we want to issue a "call+branch+lock incl+ret" instead of just
a branch+lock incl.

Ideally the larger code size should be out of line so I don't think
it's a problem other than for the wasted bytes of ram and disk. (but
hey it expands all over the vmlinux exactly because it's get_page so
it's more than just an hundred bytes). It's certainly better to waste
some kbyte regardless of the number of pages initialized by the
kernel, than to run slower.

> > +/*
> > + * The atomic page->_mapcount, starts from -1: so that transitions
> > + * both from it and to it can be tracked, using atomic_inc_and_test
> > + * and atomic_add_negative(-1).
> > + */
> > +static inline void reset_page_mapcount(struct page *page)
> 
> I think we should have originally named this page_mapcount_reset() This
> is extra unimportant as it's a static symbol.

Do you like me to make a patch to rename it? I guess it shall be an
incremental patch or it'll do too many things at the same time.

> >  static inline void get_page(struct page *page)
> >  {
> > +	if (unlikely(PageTail(page)))
> > +		if (likely(__get_page_tail(page)))
> > +			return;
> 
> OK so we still have approximately one test-n-branch in the non-debug
> get_page().

Like before yes.

> >  	 * Getting a normal page or the head of a compound page
> > -	 * requires to already have an elevated page->_count. Only if
> > -	 * we're getting a tail page, the elevated page->_count is
> > -	 * required only in the head page, so for tail pages the
> > -	 * bugcheck only verifies that the page->_count isn't
> > -	 * negative.
> > +	 * requires to already have an elevated page->_count.
> >  	 */
> > -	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> > +	VM_BUG_ON(atomic_read(&page->_count) <= 0);
> 
> I wonder how many people enable VM_BUG_ON().  We're pretty profligate
> with those things in hot paths.

I think these are good to keep for now. They're VM_ because it's a
fast path so it's ok. the other ones are in __spit_huge_page_refcount
which is not a critical fast path and those aren't VM_ because it's
worth checking we don't get anything wrong there.

> > +int __get_page_tail(struct page *page)
> > +{
> > +	/*
> > +	 * This takes care of get_page() if run on a tail page
> > +	 * returned by one of the get_user_pages/follow_page variants.
> > +	 * get_user_pages/follow_page itself doesn't need the compound
> > +	 * lock because it runs __get_page_tail_foll() under the
> > +	 * proper PT lock that already serializes against
> > +	 * split_huge_page().
> > +	 */
> > +	unsigned long flags;
> > +	int got = 0;
> 
> Could be a bool if you like that sort of thing..

I like it. I just didn't think of using it.

> > +	struct page *page_head = compound_trans_head(page);
> 
> Missing newline here

Hmm ok.

> > +	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> > +		/*
> > +		 * page_head wasn't a dangling pointer but it
> > +		 * may not be a head page anymore by the time
> > +		 * we obtain the lock. That is ok as long as it
> > +		 * can't be freed from under us.
> > +		 */
> > +		flags = compound_lock_irqsave(page_head);
> > +		/* here __split_huge_page_refcount won't run anymore */
> > +		if (likely(PageTail(page))) {
> > +			__get_page_tail_foll(page, false);
> > +			got = 1;
> > +		}
> > +		compound_unlock_irqrestore(page_head, flags);
> > +		if (unlikely(!got))
> > +			put_page(page_head);
> > +	}
> > +	return got;
> > +}
> > +EXPORT_SYMBOL(__get_page_tail);
> 
> Ordinarily I'd squeak about a global, exported-to-modules function
> which is undocumented.  But this one is internal to get_page(), so it's
> less necessary.
> 
> Still, documenting at least the return value (the "why" rather than the
> "what") would make get_page() more understandable.

I think it's good idea to document it should not ever be called and
it's meant only to be called only by get_page.

Thanks a lot for the review. It's unfortunate I didn't find a way to
shrink the .text without putting get_page out of line with this logic,
this is the reason of the delay in the answer but if somebody has
better ideas let me know...

Here a new version with this incremental diff:

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -376,7 +376,7 @@ static inline int page_count(struct page
 	return atomic_read(&compound_head(page)->_count);
 }
 
-extern int __get_page_tail(struct page *page);
+extern bool __get_page_tail(struct page *page);
 
 static inline void get_page(struct page *page)
 {
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -59,6 +59,11 @@ static inline void __get_page_tail_foll(
 	atomic_inc(&page->_mapcount);
 }
 
+/*
+ * This is meant to be called as the FOLL_GET operation of
+ * follow_page() and it must be called while holding the proper PT
+ * lock while the pte (or pmd_trans_huge) is still mapping the page.
+ */
 static inline void get_page_foll(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,6 +79,7 @@ static void put_compound_page(struct pag
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
 		struct page *page_head = compound_trans_head(page);
+
 		if (likely(page != page_head &&
 			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
@@ -143,7 +144,11 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
-int __get_page_tail(struct page *page)
+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+bool __get_page_tail(struct page *page)
 {
 	/*
 	 * This takes care of get_page() if run on a tail page
@@ -154,8 +159,9 @@ int __get_page_tail(struct page *page)
 	 * split_huge_page().
 	 */
 	unsigned long flags;
-	int got = 0;
+	bool got = false;
 	struct page *page_head = compound_trans_head(page);
+
 	if (likely(page != page_head && get_page_unless_zero(page_head))) {
 		/*
 		 * page_head wasn't a dangling pointer but it
@@ -167,7 +173,7 @@ int __get_page_tail(struct page *page)
 		/* here __split_huge_page_refcount won't run anymore */
 		if (likely(PageTail(page))) {
 			__get_page_tail_foll(page, false);
-			got = 1;
+			got = true;
 		}
 		compound_unlock_irqrestore(page_head, flags);
 		if (unlikely(!got))

===
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <aarcange@redhat.com>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 /*
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -114,8 +114,9 @@ static inline void get_huge_page_tail(st
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,36 +356,39 @@ static inline struct page *compound_head
 	return page;
 }
 
+/*
+ * The atomic page->_mapcount, starts from -1: so that transitions
+ * both from it and to it can be tracked, using atomic_inc_and_test
+ * and atomic_add_negative(-1).
+ */
+static inline void reset_page_mapcount(struct page *page)
+{
+	atomic_set(&(page)->_mapcount, -1);
+}
+
+static inline int page_mapcount(struct page *page)
+{
+	return atomic_read(&(page)->_mapcount) + 1;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
 }
 
+extern bool __get_page_tail(struct page *page);
+
 static inline void get_page(struct page *page)
 {
+	if (unlikely(PageTail(page)))
+		if (likely(__get_page_tail(page)))
+			return;
 	/*
 	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count. Only if
-	 * we're getting a tail page, the elevated page->_count is
-	 * required only in the head page, so for tail pages the
-	 * bugcheck only verifies that the page->_count isn't
-	 * negative.
+	 * requires to already have an elevated page->_count.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
+	VM_BUG_ON(atomic_read(&page->_count) <= 0);
 	atomic_inc(&page->_count);
-	/*
-	 * Getting a tail page will elevate both the head and tail
-	 * page->_count(s).
-	 */
-	if (unlikely(PageTail(page))) {
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount can't run under
-		 * get_page().
-		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
-	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -804,21 +807,6 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
- * The atomic page->_mapcount, like _count, starts from -1:
- * so that transitions both from it and to it can be tracked,
- * using atomic_inc_and_test and atomic_add_negative(-1).
- */
-static inline void reset_page_mapcount(struct page *page)
-{
-	atomic_set(&(page)->_mapcount, -1);
-}
-
-static inline int page_mapcount(struct page *page)
-{
-	return atomic_read(&(page)->_mapcount) + 1;
-}
-
-/*
  * Return true if this page is mapped into pagetables.
  */
 static inline int page_mapped(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -62,10 +62,23 @@ struct page {
 			struct {
 
 				union {
-					atomic_t _mapcount;	/* Count of ptes mapped in mms,
-							 * to show when page is mapped
-							 * & limit reverse map searches.
-							 */
+					/*
+					 * Count of ptes mapped in
+					 * mms, to show when page is
+					 * mapped & limit reverse map
+					 * searches.
+					 *
+					 * Used also for tail pages
+					 * refcounting instead of
+					 * _count. Tail pages cannot
+					 * be mapped and keeping the
+					 * tail page _count zero at
+					 * all times guarantees
+					 * get_page_unless_zero() will
+					 * never succeed on tail
+					 * pages.
+					 */
+					atomic_t _mapcount;
 
 					struct {
 						unsigned inuse:16;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -989,7 +989,7 @@ struct page *follow_trans_huge_pmd(struc
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON(!PageCompound(page));
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 
 out:
 	return page;
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(s
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1164,11 +1165,27 @@ static void __split_huge_page_refcount(s
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		/* tail_page->_mapcount cannot change */
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_count += page_mapcount(page_tail);
+		/* check for overflow */
+		BUG_ON(tail_count < 0);
+		BUG_ON(atomic_read(&page_tail->_count) != 0);
+		/*
+		 * tail_page->_count is zero and not changing from
+		 * under us. But get_page_unless_zero() may be running
+		 * from under us on the tail_page. If we used
+		 * atomic_set() below instead of atomic_add(), we
+		 * would then run atomic_set() concurrently with
+		 * get_page_unless_zero(), and atomic_set() is
+		 * implemented in C not using locked ops. spin_unlock
+		 * on x86 sometime uses locked ops because of PPro
+		 * errata 66, 92, so unless somebody can guarantee
+		 * atomic_set() here would be safe on all archs (and
+		 * not only on x86), it's safer to use atomic_add().
+		 */
+		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
+			   &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1186,10 +1203,7 @@ static void __split_huge_page_refcount(s
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
+		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
 		/*
@@ -1206,7 +1220,6 @@ static void __split_huge_page_refcount(s
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		BUG_ON(page_mapcount(page_tail));
 		page_tail->_mapcount = page->_mapcount;
 
 		BUG_ON(page_tail->mapping);
@@ -1223,6 +1236,8 @@ static void __split_huge_page_refcount(s
 
 		lru_add_page_tail(zone, page, page_tail);
 	}
+	atomic_sub(tail_count, &page->_count);
+	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,52 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+static inline void __get_page_tail_foll(struct page *page,
+					bool get_page_head)
+{
+	/*
+	 * If we're getting a tail page, the elevated page->_count is
+	 * required only in the head page and we will elevate the head
+	 * page->_count and tail page->_mapcount.
+	 *
+	 * We elevate page_tail->_mapcount for tail pages to force
+	 * page_tail->_count to be zero at all times to avoid getting
+	 * false positives from get_page_unless_zero() with
+	 * speculative page access (like in
+	 * page_cache_get_speculative()) on tail pages.
+	 */
+	VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	VM_BUG_ON(page_mapcount(page) < 0);
+	if (get_page_head)
+		atomic_inc(&page->first_page->_count);
+	atomic_inc(&page->_mapcount);
+}
+
+/*
+ * This is meant to be called as the FOLL_GET operation of
+ * follow_page() and it must be called while holding the proper PT
+ * lock while the pte (or pmd_trans_huge) is still mapping the page.
+ */
+static inline void get_page_foll(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount() can't run under
+		 * get_page_foll() because we hold the proper PT lock.
+		 */
+		__get_page_tail_foll(page, true);
+	else {
+		/*
+		 * Getting a normal page or the head of a compound page
+		 * requires to already have an elevated page->_count.
+		 */
+		VM_BUG_ON(atomic_read(&page->_count) <= 0);
+		atomic_inc(&page->_count);
+	}
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1503,7 +1503,7 @@ split_fallthrough:
 	}
 
 	if (flags & FOLL_GET)
-		get_page(page);
+		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -78,39 +78,22 @@ static void put_compound_page(struct pag
 {
 	if (unlikely(PageTail(page))) {
 		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
-		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
-		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+		struct page *page_head = compound_trans_head(page);
+
+		if (likely(page != page_head &&
+			   get_page_unless_zero(page_head))) {
 			unsigned long flags;
 			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
+			 * page_head wasn't a dangling pointer but it
+			 * may not be a head page anymore by the time
+			 * we obtain the lock. That is ok as long as it
+			 * can't be freed from under us.
 			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
 			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);
 				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
 				if (put_page_testzero(page_head))
 					__put_single_page(page_head);
 			out_put_single:
@@ -121,16 +104,17 @@ static void put_compound_page(struct pag
 			VM_BUG_ON(page_head != page->first_page);
 			/*
 			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
+			 * get_page_unless_zero() now that
+			 * __split_huge_page_refcount() is blocked on
+			 * the compound_lock.
 			 */
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
+			VM_BUG_ON(page_mapcount(page) <= 0);
+			atomic_dec(&page->_mapcount);
 			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(atomic_read(&page->_count) != 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
@@ -160,6 +144,45 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+bool __get_page_tail(struct page *page)
+{
+	/*
+	 * This takes care of get_page() if run on a tail page
+	 * returned by one of the get_user_pages/follow_page variants.
+	 * get_user_pages/follow_page itself doesn't need the compound
+	 * lock because it runs __get_page_tail_foll() under the
+	 * proper PT lock that already serializes against
+	 * split_huge_page().
+	 */
+	unsigned long flags;
+	bool got = false;
+	struct page *page_head = compound_trans_head(page);
+
+	if (likely(page != page_head && get_page_unless_zero(page_head))) {
+		/*
+		 * page_head wasn't a dangling pointer but it
+		 * may not be a head page anymore by the time
+		 * we obtain the lock. That is ok as long as it
+		 * can't be freed from under us.
+		 */
+		flags = compound_lock_irqsave(page_head);
+		/* here __split_huge_page_refcount won't run anymore */
+		if (likely(PageTail(page))) {
+			__get_page_tail_foll(page, false);
+			got = true;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+		if (unlikely(!got))
+			put_page(page_head);
+	}
+	return got;
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #6
  2011-09-08 16:51                             ` Andrea Arcangeli
@ 2011-09-23 15:57                               ` Peter Zijlstra
  -1 siblings, 0 replies; 109+ messages in thread
From: Peter Zijlstra @ 2011-09-23 15:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Thu, 2011-09-08 at 18:51 +0200, Andrea Arcangeli wrote:

> +++ b/arch/powerpc/mm/gup.c
> +++ b/arch/x86/mm/gup.c

lacking a diffstat a quick look seems to suggest you missed a few:

$ ls arch/*/mm/gup.c
arch/powerpc/mm/gup.c  
arch/s390/mm/gup.c  
arch/sh/mm/gup.c  
arch/sparc/mm/gup.c  
arch/x86/mm/gup.c



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #6
@ 2011-09-23 15:57                               ` Peter Zijlstra
  0 siblings, 0 replies; 109+ messages in thread
From: Peter Zijlstra @ 2011-09-23 15:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney

On Thu, 2011-09-08 at 18:51 +0200, Andrea Arcangeli wrote:

> +++ b/arch/powerpc/mm/gup.c
> +++ b/arch/x86/mm/gup.c

lacking a diffstat a quick look seems to suggest you missed a few:

$ ls arch/*/mm/gup.c
arch/powerpc/mm/gup.c  
arch/s390/mm/gup.c  
arch/sh/mm/gup.c  
arch/sparc/mm/gup.c  
arch/x86/mm/gup.c


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #6
  2011-09-23 15:57                               ` Peter Zijlstra
@ 2011-09-30 13:58                                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-09-30 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hi everyone,

On Fri, Sep 23, 2011 at 05:57:12PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-08 at 18:51 +0200, Andrea Arcangeli wrote:
> 
> > +++ b/arch/powerpc/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> 
> lacking a diffstat a quick look seems to suggest you missed a few:
> 
> $ ls arch/*/mm/gup.c
> arch/powerpc/mm/gup.c  
> arch/s390/mm/gup.c  
> arch/sh/mm/gup.c  
> arch/sparc/mm/gup.c  
> arch/x86/mm/gup.c

sh should be already ok, sparc too as they don't seem to support
hugetlbfs or THP. s390 32bit probably too. sh actually has some
HUGETLB optional thing but I don't see it handling it in gup...

But I think I missed s39064bit! But after a closer look surprisingly
they weren't ok before this change too! Even ppc.

In fact they will work better with this change applied because
succeeding a page_cache_get_speculative on a TailPage (like it could
have happened before) would have resulted in a VM_BUG_ON before
returning from page_cache_get_speculative. I guess not many are
testing O_DIRECT on hugetlbfs on ppc64,s39064bit. The PageTail check
in powerpc/mm/gup.c suddenly looks like a noop. Weird.

Also while doing this I found longstanding bugs in powerpc. nr
includes also pages found before calling gup_hugepte. So if that race
triggers (mremap,munmap changing the huge pagetable under gup_fast,
which can definitely happen) it'd lead to put_page being called too
many times on the hugetlbfs page.

Also in the below code probably they intended to use "head", not
"page".... "page" is out of bounds... so the rollback in the race case
was going to free a random page even in older kernels.

       if (unlikely(pte_val(pte) != pte_val(*ptep))) {
               /* Could be optimized better */
               while (*nr) {
                       put_page(page);

But I'm not sure why ppc always checks if the pte changed and tries to
rollback. There's no need of that, if gup_fast could run it means
we're not within a mmu_notifier_invalidate_range_start/end critical
section, and mmu_notifier_invalidate_page won't run until after the
tlb flush returned (so after gup_fast returned on the other cpu), and
so the pages are guaranteed to always have page_count >= 0
(munmap/mremap will stop and wait in
mmu_notifier_invalidate_range_start or in mmu_notifier_invalidate_page
where the IPI delivery will wait before releasing the page count, to
flush any secondary mapping before the primary mapping is allowed to
free the page).

In short there are two fishy things in ppc code (in addition to the
race-rollback corrupting memory which is just a minor implementation
bug trivial to correct, besides after THP we've to put_page the right
subpage not just the head so it must be refactored anyway):

1) page_cache_get/add_speculative is not needed anywhere in gup_fast
path of ppc as long as irqs are disabled (it simulates the cpu doing a tlb
miss). Special care has to be taken so you read the ptep to a local
variable (stored in cpu register or kernel stack), and then you
evaluate the local pte_t (stopping reading from the pointer). Then you
do pte_page on the _local_ pte_t variable, and you know you can
get_page without doing any get_page_unless_zero (plus there's a
VM_BUG_ON in get_page to verify we're not running get_page on a page
with a zero count).

Hmmm explanation of point 1) above self-remind that maybe even x86
should always use ACCESS_ONCE, even for the pmds (it already does for
the pte, either that or it has a smp_rmb() after finish reading it),
(probably not required for upper layers as they can't be tear down
until the IPI runs), now there's no way gcc is stupid enough to
re-read from pointer but in theory it could without barrier(). This
isn't just for THP but for hugetlbfs too, it's only theoretical though.

2) the above pte_val(local_pte_t) != pte(*ptep) checks in theory as
useless for ppc, as long as you verify the pte is ok (so you arrived
reading before mremap/munmap changed the pte), the page can't go away
under you because the tlb flush will wait.

I didn't start yet checking s390 in detail but it looks close to ppc
code. Let's sort out ppc first.

Now I'd like to know if the soft tlb miss handler of powerpc changes
something and really requires the code commented point 1) and 2) (even
if that code and race checks are not required on x86 for the reason I
just mentioned...).

I can make a patch to try to keep these page_cache_get/add_speculative
and the pte_val(local_pte_t) != pte(*ptep) checks intact (they're
superfluous but they can't hurt obviously). So I could make a patch
that works with current code, but if I could safely delete those I'd
prefer as they're quite confusing on why they're needed. But before I
do that I need ack from ppc people to be sure it's ok.. I don't know
the assembly of the tlb miss hashtable handler to be sure... I don't
exclude ppc has different ipi/irq semantics for the gup_fast case and
really requires those checks.

CC'ed Benjamin :)

Comments welcome!
Andrea

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] thp: tail page refcounting fix #6
@ 2011-09-30 13:58                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-09-30 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hi everyone,

On Fri, Sep 23, 2011 at 05:57:12PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-08 at 18:51 +0200, Andrea Arcangeli wrote:
> 
> > +++ b/arch/powerpc/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> 
> lacking a diffstat a quick look seems to suggest you missed a few:
> 
> $ ls arch/*/mm/gup.c
> arch/powerpc/mm/gup.c  
> arch/s390/mm/gup.c  
> arch/sh/mm/gup.c  
> arch/sparc/mm/gup.c  
> arch/x86/mm/gup.c

sh should be already ok, sparc too as they don't seem to support
hugetlbfs or THP. s390 32bit probably too. sh actually has some
HUGETLB optional thing but I don't see it handling it in gup...

But I think I missed s39064bit! But after a closer look surprisingly
they weren't ok before this change too! Even ppc.

In fact they will work better with this change applied because
succeeding a page_cache_get_speculative on a TailPage (like it could
have happened before) would have resulted in a VM_BUG_ON before
returning from page_cache_get_speculative. I guess not many are
testing O_DIRECT on hugetlbfs on ppc64,s39064bit. The PageTail check
in powerpc/mm/gup.c suddenly looks like a noop. Weird.

Also while doing this I found longstanding bugs in powerpc. nr
includes also pages found before calling gup_hugepte. So if that race
triggers (mremap,munmap changing the huge pagetable under gup_fast,
which can definitely happen) it'd lead to put_page being called too
many times on the hugetlbfs page.

Also in the below code probably they intended to use "head", not
"page".... "page" is out of bounds... so the rollback in the race case
was going to free a random page even in older kernels.

       if (unlikely(pte_val(pte) != pte_val(*ptep))) {
               /* Could be optimized better */
               while (*nr) {
                       put_page(page);

But I'm not sure why ppc always checks if the pte changed and tries to
rollback. There's no need of that, if gup_fast could run it means
we're not within a mmu_notifier_invalidate_range_start/end critical
section, and mmu_notifier_invalidate_page won't run until after the
tlb flush returned (so after gup_fast returned on the other cpu), and
so the pages are guaranteed to always have page_count >= 0
(munmap/mremap will stop and wait in
mmu_notifier_invalidate_range_start or in mmu_notifier_invalidate_page
where the IPI delivery will wait before releasing the page count, to
flush any secondary mapping before the primary mapping is allowed to
free the page).

In short there are two fishy things in ppc code (in addition to the
race-rollback corrupting memory which is just a minor implementation
bug trivial to correct, besides after THP we've to put_page the right
subpage not just the head so it must be refactored anyway):

1) page_cache_get/add_speculative is not needed anywhere in gup_fast
path of ppc as long as irqs are disabled (it simulates the cpu doing a tlb
miss). Special care has to be taken so you read the ptep to a local
variable (stored in cpu register or kernel stack), and then you
evaluate the local pte_t (stopping reading from the pointer). Then you
do pte_page on the _local_ pte_t variable, and you know you can
get_page without doing any get_page_unless_zero (plus there's a
VM_BUG_ON in get_page to verify we're not running get_page on a page
with a zero count).

Hmmm explanation of point 1) above self-remind that maybe even x86
should always use ACCESS_ONCE, even for the pmds (it already does for
the pte, either that or it has a smp_rmb() after finish reading it),
(probably not required for upper layers as they can't be tear down
until the IPI runs), now there's no way gcc is stupid enough to
re-read from pointer but in theory it could without barrier(). This
isn't just for THP but for hugetlbfs too, it's only theoretical though.

2) the above pte_val(local_pte_t) != pte(*ptep) checks in theory as
useless for ppc, as long as you verify the pte is ok (so you arrived
reading before mremap/munmap changed the pte), the page can't go away
under you because the tlb flush will wait.

I didn't start yet checking s390 in detail but it looks close to ppc
code. Let's sort out ppc first.

Now I'd like to know if the soft tlb miss handler of powerpc changes
something and really requires the code commented point 1) and 2) (even
if that code and race checks are not required on x86 for the reason I
just mentioned...).

I can make a patch to try to keep these page_cache_get/add_speculative
and the pte_val(local_pte_t) != pte(*ptep) checks intact (they're
superfluous but they can't hurt obviously). So I could make a patch
that works with current code, but if I could safely delete those I'd
prefer as they're quite confusing on why they're needed. But before I
do that I need ack from ppc people to be sure it's ok.. I don't know
the assembly of the tlb miss hashtable handler to be sure... I don't
exclude ppc has different ipi/irq semantics for the gup_fast case and
really requires those checks.

CC'ed Benjamin :)

Comments welcome!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6]
  2011-09-23 15:57                               ` Peter Zijlstra
  (?)
  (?)
@ 2011-10-16 20:37                               ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hi everyone,

so I reviewed the ppc gup_fast hugetlbfs code a bit, fixed the
longstanding memory corrupting bugs (could trigger if mmremap
functions run under gup_fast) and I fixed the code that was supposed
to make it work with thp introduction in 2.6.38 and more recently with
the tail refcounting race fixes in -mm. This is incremental with the
thp refcoutning race fixes merged in -mm.

To me those rollbacking if the pte changed that ppc is doing looks
unnecessary, the speculative access also looks unnecessary (there is
no way the page_count of the head or regular pages can be zero
there). x86 doesn't do any specualtive refcounting and it won't care
if the pte changed (we know the page can't go away from under us
because irqs are disabled). If tlb flushing code works on ppc like x86
there should be no need of that.

However I didn't remove those two rollback conditions, in theory it
shouldn't hurt (well not anymore, after fixing the two corrupting
bugs...). I just tried to make the minimal changes required because I
didn't test it. It'd be nice if ppc users could test it with O_DIRECT
on top of hugetlbfs and report if this works. I build-tested it
though, so it should build just fine at least.

s390x should be the only other arch that needs revisiting to make
gup_fast + hugetlbfs to work properly. I'll do that next.

[PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast
[PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page
[PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times
[PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast
  2011-09-23 15:57                               ` Peter Zijlstra
                                                 ` (2 preceding siblings ...)
  (?)
@ 2011-10-16 20:37                               ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
those should be handled by gup_hugepd only, so these checks are
superfluous.

Plus if this wasn't a noop, it would have oopsed because, the insistence
of using the speculative refcounting would trigger a VM_BUG_ON if a tail
page was encountered in the page_cache_get_speculative().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index b9e1c7f..d7efdbf 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,17 +16,6 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -58,8 +47,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(page);
 			return 0;
 		}
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page
  2011-09-23 15:57                               ` Peter Zijlstra
                                                 ` (3 preceding siblings ...)
  (?)
@ 2011-10-16 20:37                               ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

"page" may have changed to point to the next hugepage after the loop
completed, The references have been taken on the head page, so the
put_page must happen there too.

This is a longstanding issue pre-thp inclusion.

It's totally unclear how these page_cache_add_speculative and
pte_val(pte) != pte_val(*ptep) checks are necessary across all the
powerpc gup_fast code, when x86 doesn't need any of that: there's no way
the page can be freed with irq disabled so we're guaranteed the
atomic_inc will happen on a page with page_count > 0 (so not needing the
speculative check). The pte check is also meaningless on x86: no need to
rollback on x86 if the pte changed, because the pte can still change a
CPU tick after the check succeeded and it won't be rolled back in that
case. The important thing is we got a reference on a valid page that was
mapped there a CPU tick ago. So not knowing the soft tlb refill code of
ppc64 in great detail I'm not removing the "speculative" page_count
increase and the pte checks across all the code, but unless there's a
strong reason for it they should be later cleaned up too.

If a pte can change from huge to non-huge (like it could happen with
THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat
the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs
only so I'm not altering that.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0b9a5c1..b649c28 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -429,7 +429,7 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 		/* Could be optimized better */
 		while (*nr) {
-			put_page(page);
+			put_page(head);
 			(*nr)--;
 		}
 	}
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times
  2011-09-23 15:57                               ` Peter Zijlstra
                                                 ` (4 preceding siblings ...)
  (?)
@ 2011-10-16 20:37                               ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

We only taken "refs" pins on the head page not "*nr" pins.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index b649c28..78b14ab 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -428,10 +428,9 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 		/* Could be optimized better */
-		while (*nr) {
+		*nr -= refs;
+		while (refs--)
 			put_page(head);
-			(*nr)--;
-		}
 	}
 
 	return 1;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting
  2011-09-23 15:57                               ` Peter Zijlstra
                                                 ` (5 preceding siblings ...)
  (?)
@ 2011-10-16 20:37                               ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Up to this point the code assumed old refcounting for hugepages
(pre-thp). This updates the code directly to the thp mapcount tail page
refcounting.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 78b14ab..a618ef0 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -385,12 +385,23 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		       unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask;
 	unsigned long pte_end;
-	struct page *head, *page;
+	struct page *head, *page, *tail;
 	pte_t pte;
 	int refs;
 
@@ -413,6 +424,7 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 	head = pte_page(pte);
 
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -431,6 +443,16 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
+	} else {
+		/*
+		 * Any tail page need their mapcount reference taken
+		 * before we return.
+		 */
+		while (refs--) {
+			if (PageTail(tail))
+				get_huge_page_tail(tail);
+			tail++;
+		}
 	}
 
 	return 1;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6]
  2011-09-23 15:57                               ` Peter Zijlstra
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hi everyone,

so I reviewed the ppc gup_fast hugetlbfs code a bit, fixed the
longstanding memory corrupting bugs (could trigger if mmremap
functions run under gup_fast) and I fixed the code that was supposed
to make it work with thp introduction in 2.6.38 and more recently with
the tail refcounting race fixes in -mm. This is incremental with the
thp refcoutning race fixes merged in -mm.

To me those rollbacking if the pte changed that ppc is doing looks
unnecessary, the speculative access also looks unnecessary (there is
no way the page_count of the head or regular pages can be zero
there). x86 doesn't do any specualtive refcounting and it won't care
if the pte changed (we know the page can't go away from under us
because irqs are disabled). If tlb flushing code works on ppc like x86
there should be no need of that.

However I didn't remove those two rollback conditions, in theory it
shouldn't hurt (well not anymore, after fixing the two corrupting
bugs...). I just tried to make the minimal changes required because I
didn't test it. It'd be nice if ppc users could test it with O_DIRECT
on top of hugetlbfs and report if this works. I build-tested it
though, so it should build just fine at least.

s390x should be the only other arch that needs revisiting to make
gup_fast + hugetlbfs to work properly. I'll do that next.

[PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast
[PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page
[PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times
[PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting

^ permalink raw reply	[flat|nested] 109+ messages in thread

* thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6]
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hi everyone,

so I reviewed the ppc gup_fast hugetlbfs code a bit, fixed the
longstanding memory corrupting bugs (could trigger if mmremap
functions run under gup_fast) and I fixed the code that was supposed
to make it work with thp introduction in 2.6.38 and more recently with
the tail refcounting race fixes in -mm. This is incremental with the
thp refcoutning race fixes merged in -mm.

To me those rollbacking if the pte changed that ppc is doing looks
unnecessary, the speculative access also looks unnecessary (there is
no way the page_count of the head or regular pages can be zero
there). x86 doesn't do any specualtive refcounting and it won't care
if the pte changed (we know the page can't go away from under us
because irqs are disabled). If tlb flushing code works on ppc like x86
there should be no need of that.

However I didn't remove those two rollback conditions, in theory it
shouldn't hurt (well not anymore, after fixing the two corrupting
bugs...). I just tried to make the minimal changes required because I
didn't test it. It'd be nice if ppc users could test it with O_DIRECT
on top of hugetlbfs and report if this works. I build-tested it
though, so it should build just fine at least.

s390x should be the only other arch that needs revisiting to make
gup_fast + hugetlbfs to work properly. I'll do that next.

[PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast
[PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page
[PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times
[PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast
  2011-09-23 15:57                               ` Peter Zijlstra
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
those should be handled by gup_hugepd only, so these checks are
superfluous.

Plus if this wasn't a noop, it would have oopsed because, the insistence
of using the speculative refcounting would trigger a VM_BUG_ON if a tail
page was encountered in the page_cache_get_speculative().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index b9e1c7f..d7efdbf 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,17 +16,6 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -58,8 +47,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(page);
 			return 0;
 		}
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
those should be handled by gup_hugepd only, so these checks are
superfluous.

Plus if this wasn't a noop, it would have oopsed because, the insistence
of using the speculative refcounting would trigger a VM_BUG_ON if a tail
page was encountered in the page_cache_get_speculative().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index b9e1c7f..d7efdbf 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,17 +16,6 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -58,8 +47,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(page);
 			return 0;
 		}
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page
  2011-09-23 15:57                               ` Peter Zijlstra
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

"page" may have changed to point to the next hugepage after the loop
completed, The references have been taken on the head page, so the
put_page must happen there too.

This is a longstanding issue pre-thp inclusion.

It's totally unclear how these page_cache_add_speculative and
pte_val(pte) != pte_val(*ptep) checks are necessary across all the
powerpc gup_fast code, when x86 doesn't need any of that: there's no way
the page can be freed with irq disabled so we're guaranteed the
atomic_inc will happen on a page with page_count > 0 (so not needing the
speculative check). The pte check is also meaningless on x86: no need to
rollback on x86 if the pte changed, because the pte can still change a
CPU tick after the check succeeded and it won't be rolled back in that
case. The important thing is we got a reference on a valid page that was
mapped there a CPU tick ago. So not knowing the soft tlb refill code of
ppc64 in great detail I'm not removing the "speculative" page_count
increase and the pte checks across all the code, but unless there's a
strong reason for it they should be later cleaned up too.

If a pte can change from huge to non-huge (like it could happen with
THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat
the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs
only so I'm not altering that.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0b9a5c1..b649c28 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -429,7 +429,7 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 		/* Could be optimized better */
 		while (*nr) {
-			put_page(page);
+			put_page(head);
 			(*nr)--;
 		}
 	}
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

"page" may have changed to point to the next hugepage after the loop
completed, The references have been taken on the head page, so the
put_page must happen there too.

This is a longstanding issue pre-thp inclusion.

It's totally unclear how these page_cache_add_speculative and
pte_val(pte) != pte_val(*ptep) checks are necessary across all the
powerpc gup_fast code, when x86 doesn't need any of that: there's no way
the page can be freed with irq disabled so we're guaranteed the
atomic_inc will happen on a page with page_count > 0 (so not needing the
speculative check). The pte check is also meaningless on x86: no need to
rollback on x86 if the pte changed, because the pte can still change a
CPU tick after the check succeeded and it won't be rolled back in that
case. The important thing is we got a reference on a valid page that was
mapped there a CPU tick ago. So not knowing the soft tlb refill code of
ppc64 in great detail I'm not removing the "speculative" page_count
increase and the pte checks across all the code, but unless there's a
strong reason for it they should be later cleaned up too.

If a pte can change from huge to non-huge (like it could happen with
THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat
the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs
only so I'm not altering that.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0b9a5c1..b649c28 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -429,7 +429,7 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 		/* Could be optimized better */
 		while (*nr) {
-			put_page(page);
+			put_page(head);
 			(*nr)--;
 		}
 	}
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times
  2011-09-23 15:57                               ` Peter Zijlstra
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

We only taken "refs" pins on the head page not "*nr" pins.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index b649c28..78b14ab 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -428,10 +428,9 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 		/* Could be optimized better */
-		while (*nr) {
+		*nr -= refs;
+		while (refs--)
 			put_page(head);
-			(*nr)--;
-		}
 	}
 
 	return 1;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

We only taken "refs" pins on the head page not "*nr" pins.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index b649c28..78b14ab 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -428,10 +428,9 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 		/* Could be optimized better */
-		while (*nr) {
+		*nr -= refs;
+		while (refs--)
 			put_page(head);
-			(*nr)--;
-		}
 	}
 
 	return 1;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting
  2011-09-23 15:57                               ` Peter Zijlstra
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Up to this point the code assumed old refcounting for hugepages
(pre-thp). This updates the code directly to the thp mapcount tail page
refcounting.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 78b14ab..a618ef0 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -385,12 +385,23 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		       unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask;
 	unsigned long pte_end;
-	struct page *head, *page;
+	struct page *head, *page, *tail;
 	pte_t pte;
 	int refs;
 
@@ -413,6 +424,7 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 	head = pte_page(pte);
 
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -431,6 +443,16 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
+	} else {
+		/*
+		 * Any tail page need their mapcount reference taken
+		 * before we return.
+		 */
+		while (refs--) {
+			if (PageTail(tail))
+				get_huge_page_tail(tail);
+			tail++;
+		}
 	}
 
 	return 1;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting
@ 2011-10-16 20:40                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Up to this point the code assumed old refcounting for hugepages
(pre-thp). This updates the code directly to the thp mapcount tail page
refcounting.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 78b14ab..a618ef0 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -385,12 +385,23 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		       unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask;
 	unsigned long pte_end;
-	struct page *head, *page;
+	struct page *head, *page, *tail;
 	pte_t pte;
 	int refs;
 
@@ -413,6 +424,7 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 	head = pte_page(pte);
 
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -431,6 +443,16 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
+	} else {
+		/*
+		 * Any tail page need their mapcount reference taken
+		 * before we return.
+		 */
+		while (refs--) {
+			if (PageTail(tail))
+				get_huge_page_tail(tail);
+			tail++;
+		}
 	}
 
 	return 1;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* thp: gup_fast s390/sparc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6]
  2011-09-23 15:57                               ` Peter Zijlstra
@ 2011-10-17 14:41                                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hello everyone,

These last three patches are incremental with the ones sent yesterday
(that in turn are incremental with the hugepage mapcount tail
refcounting race fix in -mm).

These should complete the gup_fast arch updates to support the
tail page mapcount refcounting.

sh is the only other one supporting gup_fast and hugetlbfs, but it
looked ok already so it requires no changes (it uses get_page). The
arch requiring updates can easily be found by searching:

	find arch/ -name hugetlbpage.c -or -name gup.c

I'm still uncertain why all these page_cache_get/add_speculative in
various gup.c and the pte change checks are needed there but I didn't
alter those, so if it worked before it'll still work the same.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* thp: gup_fast s390/sparc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6]
@ 2011-10-17 14:41                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hello everyone,

These last three patches are incremental with the ones sent yesterday
(that in turn are incremental with the hugepage mapcount tail
refcounting race fix in -mm).

These should complete the gup_fast arch updates to support the
tail page mapcount refcounting.

sh is the only other one supporting gup_fast and hugetlbfs, but it
looked ok already so it requires no changes (it uses get_page). The
arch requiring updates can easily be found by searching:

	find arch/ -name hugetlbpage.c -or -name gup.c

I'm still uncertain why all these page_cache_get/add_speculative in
various gup.c and the pte change checks are needed there but I didn't
alter those, so if it worked before it'll still work the same.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH 1/3] s390: gup_huge_pmd() support THP tail recounting
  2011-10-17 14:41                                 ` Andrea Arcangeli
@ 2011-10-17 14:41                                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Up to this point the code assumed old refcounting for hugepages
(pre-thp). This updates the code directly to the thp mapcount tail
page refcounting.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/s390/mm/gup.c |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 45b405c..668dda9 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -48,11 +48,22 @@ static inline int gup_pte_range(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	return 1;
 }
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask, result;
-	struct page *head, *page;
+	struct page *head, *page, *tail;
 	int refs;
 
 	result = write ? 0 : _SEGMENT_ENTRY_RO;
@@ -64,6 +75,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -81,6 +93,16 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
+	} else {
+		/*
+		 * Any tail page need their mapcount reference taken
+		 * before we return.
+		 */
+		while (refs--) {
+			if (PageTail(tail))
+				get_huge_page_tail(tail);
+			tail++;
+		}
 	}
 
 	return 1;

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 1/3] s390: gup_huge_pmd() support THP tail recounting
@ 2011-10-17 14:41                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Up to this point the code assumed old refcounting for hugepages
(pre-thp). This updates the code directly to the thp mapcount tail
page refcounting.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/s390/mm/gup.c |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 45b405c..668dda9 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -48,11 +48,22 @@ static inline int gup_pte_range(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	return 1;
 }
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask, result;
-	struct page *head, *page;
+	struct page *head, *page, *tail;
 	int refs;
 
 	result = write ? 0 : _SEGMENT_ENTRY_RO;
@@ -64,6 +75,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -81,6 +93,16 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
+	} else {
+		/*
+		 * Any tail page need their mapcount reference taken
+		 * before we return.
+		 */
+		while (refs--) {
+			if (PageTail(tail))
+				get_huge_page_tail(tail);
+			tail++;
+		}
 	}
 
 	return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/3] sparc: gup_pte_range() support THP based tail recounting
  2011-10-17 14:41                                 ` Andrea Arcangeli
@ 2011-10-17 14:41                                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Up to this point the code assumed old refcounting for hugepages
(pre-thp). This updates the code directly to the thp mapcount tail
page refcounting.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/sparc/mm/gup.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index a986b5d..afcebac 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -12,6 +12,17 @@
 #include <linux/rwsem.h>
 #include <asm/pgtable.h>
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -56,6 +67,8 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(head);
 			return 0;
 		}
+		if (head != page)
+			get_huge_page_tail(page);
 
 		pages[*nr] = page;
 		(*nr)++;

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/3] sparc: gup_pte_range() support THP based tail recounting
@ 2011-10-17 14:41                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Up to this point the code assumed old refcounting for hugepages
(pre-thp). This updates the code directly to the thp mapcount tail
page refcounting.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/sparc/mm/gup.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index a986b5d..afcebac 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -12,6 +12,17 @@
 #include <linux/rwsem.h>
 #include <asm/pgtable.h>
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -56,6 +67,8 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(head);
 			return 0;
 		}
+		if (head != page)
+			get_huge_page_tail(page);
 
 		pages[*nr] = page;
 		(*nr)++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 3/3] thp: share get_huge_page_tail()
  2011-10-17 14:41                                 ` Andrea Arcangeli
@ 2011-10-17 14:41                                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

This avoids duplicating the function in every arch gup_fast.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/mm/hugetlbpage.c |   11 -----------
 arch/s390/mm/gup.c            |   11 -----------
 arch/sparc/mm/gup.c           |   11 -----------
 arch/x86/mm/gup.c             |   11 -----------
 include/linux/mm.h            |   11 +++++++++++
 5 files changed, 11 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a618ef0..b400535 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -385,17 +385,6 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		       unsigned long end, int write, struct page **pages, int *nr)
 {
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 668dda9..755f226 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -48,17 +48,6 @@ static inline int gup_pte_range(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	return 1;
 }
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index afcebac..42c55df 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -12,17 +12,6 @@
 #include <linux/rwsem.h>
 #include <asm/pgtable.h>
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 3b5032a..ea30585 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -108,17 +108,6 @@ static inline void get_head_page_multiple(struct page *page, int nr)
 	SetPageReferenced(page);
 }
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1ce6c0..fedc5f0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -376,6 +376,17 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 extern bool __get_page_tail(struct page *page);
 
 static inline void get_page(struct page *page)

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 3/3] thp: share get_huge_page_tail()
@ 2011-10-17 14:41                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

This avoids duplicating the function in every arch gup_fast.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/mm/hugetlbpage.c |   11 -----------
 arch/s390/mm/gup.c            |   11 -----------
 arch/sparc/mm/gup.c           |   11 -----------
 arch/x86/mm/gup.c             |   11 -----------
 include/linux/mm.h            |   11 +++++++++++
 5 files changed, 11 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a618ef0..b400535 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -385,17 +385,6 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		       unsigned long end, int write, struct page **pages, int *nr)
 {
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 668dda9..755f226 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -48,17 +48,6 @@ static inline int gup_pte_range(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	return 1;
 }
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index afcebac..42c55df 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -12,17 +12,6 @@
 #include <linux/rwsem.h>
 #include <asm/pgtable.h>
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 3b5032a..ea30585 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -108,17 +108,6 @@ static inline void get_head_page_multiple(struct page *page, int nr)
 	SetPageReferenced(page);
 }
 
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run
-	 * from under us.
-	 */
-	VM_BUG_ON(page_mapcount(page) < 0);
-	VM_BUG_ON(atomic_read(&page->_count) != 0);
-	atomic_inc(&page->_mapcount);
-}
-
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1ce6c0..fedc5f0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -376,6 +376,17 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
+static inline void get_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(page_mapcount(page) < 0);
+	VM_BUG_ON(atomic_read(&page->_count) != 0);
+	atomic_inc(&page->_mapcount);
+}
+
 extern bool __get_page_tail(struct page *page);
 
 static inline void get_page(struct page *page)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* fix two more s390/sparc gup_fast bugs
  2011-09-23 15:57                               ` Peter Zijlstra
@ 2011-10-17 21:32                                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hi,

I just noticed return 0 was missing after rolling back *nr, that's not
ok as then gup_fast wouldn't abort and it would put pages in the wrong
offset in the array I think... This isn't related to the recent
changes, it was the same in 2.6.37. I don't think it's ok to return 1
after rolling back *nr.

These next two are incremental with the previous.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* fix two more s390/sparc gup_fast bugs
@ 2011-10-17 21:32                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

Hi,

I just noticed return 0 was missing after rolling back *nr, that's not
ok as then gup_fast wouldn't abort and it would put pages in the wrong
offset in the array I think... This isn't related to the recent
changes, it was the same in 2.6.37. I don't think it's ok to return 1
after rolling back *nr.

These next two are incremental with the previous.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH 1/2] s390: gup_huge_pmd() return 0 if pte changes
  2011-10-17 21:32                                 ` Andrea Arcangeli
@ 2011-10-17 21:32                                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

s390 didn't return 0 in that case, if it's rolling back the *nr
pointer it should also return zero to avoid adding pages to the array
at the wrong offset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/s390/mm/gup.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 668dda9..da33a02 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -93,16 +93,17 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
-	} else {
-		/*
-		 * Any tail page need their mapcount reference taken
-		 * before we return.
-		 */
-		while (refs--) {
-			if (PageTail(tail))
-				get_huge_page_tail(tail);
-			tail++;
-		}
+		return 0;
+	}
+
+	/*
+	 * Any tail page need their mapcount reference taken before we
+	 * return.
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
 	}
 
 	return 1;

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 1/2] s390: gup_huge_pmd() return 0 if pte changes
@ 2011-10-17 21:32                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

s390 didn't return 0 in that case, if it's rolling back the *nr
pointer it should also return zero to avoid adding pages to the array
at the wrong offset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/s390/mm/gup.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 668dda9..da33a02 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -93,16 +93,17 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
-	} else {
-		/*
-		 * Any tail page need their mapcount reference taken
-		 * before we return.
-		 */
-		while (refs--) {
-			if (PageTail(tail))
-				get_huge_page_tail(tail);
-			tail++;
-		}
+		return 0;
+	}
+
+	/*
+	 * Any tail page need their mapcount reference taken before we
+	 * return.
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
 	}
 
 	return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/2] powerpc: gup_huge_pmd() return 0 if pte changes
  2011-10-17 21:32                                 ` Andrea Arcangeli
@ 2011-10-17 21:32                                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

powerpc didn't return 0 in that case, if it's rolling back the *nr
pointer it should also return zero to avoid adding pages to the array
at the wrong offset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/mm/hugetlbpage.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a618ef0..1c59d94 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -443,16 +443,17 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
-	} else {
-		/*
-		 * Any tail page need their mapcount reference taken
-		 * before we return.
-		 */
-		while (refs--) {
-			if (PageTail(tail))
-				get_huge_page_tail(tail);
-			tail++;
-		}
+		return 0;
+	}
+
+	/*
+	 * Any tail page need their mapcount reference taken before we
+	 * return.
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
 	}
 
 	return 1;

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH 2/2] powerpc: gup_huge_pmd() return 0 if pte changes
@ 2011-10-17 21:32                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 109+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Minchan Kim, Michel Lespinasse, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Shaohua Li, Paul E. McKenney,
	Benjamin Herrenschmidt

powerpc didn't return 0 in that case, if it's rolling back the *nr
pointer it should also return zero to avoid adding pages to the array
at the wrong offset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/mm/hugetlbpage.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a618ef0..1c59d94 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -443,16 +443,17 @@ static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long add
 		*nr -= refs;
 		while (refs--)
 			put_page(head);
-	} else {
-		/*
-		 * Any tail page need their mapcount reference taken
-		 * before we return.
-		 */
-		while (refs--) {
-			if (PageTail(tail))
-				get_huge_page_tail(tail);
-			tail++;
-		}
+		return 0;
+	}
+
+	/*
+	 * Any tail page need their mapcount reference taken before we
+	 * return.
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
 	}
 
 	return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH 2/3] sparc: gup_pte_range() support THP based tail recounting
  2011-10-17 14:41                                   ` Andrea Arcangeli
@ 2011-10-17 22:44                                     ` David Miller
  -1 siblings, 0 replies; 109+ messages in thread
From: David Miller @ 2011-10-17 22:44 UTC (permalink / raw)
  To: aarcange
  Cc: peterz, akpm, minchan.kim, walken, linux-mm, linux-kernel, hughd,
	jweiner, riel, mgorman, kosaki.motohiro, shaohua.li, paulmck,
	benh

From: Andrea Arcangeli <aarcange@redhat.com>
Date: Mon, 17 Oct 2011 16:41:56 +0200

> Up to this point the code assumed old refcounting for hugepages
> (pre-thp). This updates the code directly to the thp mapcount tail
> page refcounting.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH 2/3] sparc: gup_pte_range() support THP based tail recounting
@ 2011-10-17 22:44                                     ` David Miller
  0 siblings, 0 replies; 109+ messages in thread
From: David Miller @ 2011-10-17 22:44 UTC (permalink / raw)
  To: aarcange
  Cc: peterz, akpm, minchan.kim, walken, linux-mm, linux-kernel, hughd,
	jweiner, riel, mgorman, kosaki.motohiro, shaohua.li, paulmck,
	benh

From: Andrea Arcangeli <aarcange@redhat.com>
Date: Mon, 17 Oct 2011 16:41:56 +0200

> Up to this point the code assumed old refcounting for hugepages
> (pre-thp). This updates the code directly to the thp mapcount tail
> page refcounting.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2011-10-17 22:45 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-19  7:48 [PATCH 0/9] Use RCU to stabilize page counts Michel Lespinasse
2011-08-19  7:48 ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 1/9] mm: rcu read lock for getting reference on pages in migration_entry_wait() Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 2/9] mm: avoid calling get_page_unless_zero() when charging cgroups Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 3/9] mm: rcu read lock when getting from tail to head page Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 4/9] mm: use get_page in deactivate_page() Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 5/9] kvm: use get_page instead of get_page_unless_zero Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 6/9] mm: assert that get_page_unless_zero() callers hold the rcu lock Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19 23:28   ` Andi Kleen
2011-08-19 23:28     ` Andi Kleen
2011-08-19  7:48 ` [PATCH 7/9] rcu: rcu_get_gp_cookie() / rcu_gp_cookie_elapsed() stand-ins Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 8/9] mm: add API for setting a grace period cookie on compound pages Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:48 ` [PATCH 9/9] mm: make sure tail page counts are stable before splitting THP pages Michel Lespinasse
2011-08-19  7:48   ` Michel Lespinasse
2011-08-19  7:53 ` [PATCH 0/9] Use RCU to stabilize page counts Michel Lespinasse
2011-08-19  7:53   ` Michel Lespinasse
2011-08-22 21:33 ` [PATCH] thp: tail page refcounting fix Andrea Arcangeli
2011-08-22 21:33   ` Andrea Arcangeli
2011-08-23 14:55   ` Andrea Arcangeli
2011-08-23 14:55     ` Andrea Arcangeli
2011-08-23 16:45   ` Minchan Kim
2011-08-23 16:45     ` Minchan Kim
2011-08-23 16:54     ` Andrea Arcangeli
2011-08-23 16:54       ` Andrea Arcangeli
2011-08-23 19:52   ` Michel Lespinasse
2011-08-23 19:52     ` Michel Lespinasse
2011-08-24  0:09     ` Andrea Arcangeli
2011-08-24  0:09       ` Andrea Arcangeli
2011-08-24  0:27       ` Andrea Arcangeli
2011-08-24  0:27         ` Andrea Arcangeli
2011-08-24 13:34         ` [PATCH] thp: tail page refcounting fix #2 Andrea Arcangeli
2011-08-24 13:34           ` Andrea Arcangeli
2011-08-26  6:24           ` Michel Lespinasse
2011-08-26  6:24             ` Michel Lespinasse
2011-08-26 16:10             ` Andrea Arcangeli
2011-08-26 16:10               ` Andrea Arcangeli
2011-08-26 18:54               ` [PATCH] thp: tail page refcounting fix #3 Andrea Arcangeli
2011-08-26 18:54                 ` Andrea Arcangeli
2011-08-27  9:41                 ` Michel Lespinasse
2011-08-27  9:41                   ` Michel Lespinasse
2011-08-27 17:34                   ` [PATCH] thp: tail page refcounting fix #4 Andrea Arcangeli
2011-08-27 17:34                     ` Andrea Arcangeli
2011-08-29  4:20                     ` Minchan Kim
2011-08-29  4:20                       ` Minchan Kim
2011-09-01 15:24                       ` [PATCH] thp: tail page refcounting fix #5 Andrea Arcangeli
2011-09-01 15:24                         ` Andrea Arcangeli
2011-09-01 22:27                         ` Michel Lespinasse
2011-09-01 22:27                           ` Michel Lespinasse
2011-09-01 23:28                         ` Andrew Morton
2011-09-01 23:28                           ` Andrew Morton
2011-09-01 23:45                           ` Andi Kleen
2011-09-01 23:45                             ` Andi Kleen
2011-09-02  0:20                             ` Andrea Arcangeli
2011-09-02  0:20                               ` Andrea Arcangeli
2011-09-02  1:17                               ` Andi Kleen
2011-09-02  1:17                                 ` Andi Kleen
2011-09-02  0:03                         ` Andrew Morton
2011-09-02  0:03                           ` Andrew Morton
2011-09-08 16:51                           ` [PATCH] thp: tail page refcounting fix #6 Andrea Arcangeli
2011-09-08 16:51                             ` Andrea Arcangeli
2011-09-23 15:57                             ` Peter Zijlstra
2011-09-23 15:57                               ` Peter Zijlstra
2011-09-30 13:58                               ` Andrea Arcangeli
2011-09-30 13:58                                 ` Andrea Arcangeli
2011-10-16 20:37                               ` thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times Andrea Arcangeli
2011-10-16 20:37                               ` [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting Andrea Arcangeli
2011-10-16 20:40                               ` thp: gup_fast ppc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 1/4] powerpc: remove superfluous PageTail checks on the pte gup_fast Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 2/4] powerpc: get_hugepte() don't put_page() the wrong page Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 3/4] powerpc: gup_hugepte() avoid to free the head page too many times Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-16 20:40                               ` [PATCH 4/4] powerpc: gup_hugepte() support THP based tail recounting Andrea Arcangeli
2011-10-16 20:40                                 ` Andrea Arcangeli
2011-10-17 14:41                               ` thp: gup_fast s390/sparc tail refcounting [was Re: [PATCH] thp: tail page refcounting fix #6] Andrea Arcangeli
2011-10-17 14:41                                 ` Andrea Arcangeli
2011-10-17 14:41                                 ` [PATCH 1/3] s390: gup_huge_pmd() support THP tail recounting Andrea Arcangeli
2011-10-17 14:41                                   ` Andrea Arcangeli
2011-10-17 14:41                                 ` [PATCH 2/3] sparc: gup_pte_range() support THP based " Andrea Arcangeli
2011-10-17 14:41                                   ` Andrea Arcangeli
2011-10-17 22:44                                   ` David Miller
2011-10-17 22:44                                     ` David Miller
2011-10-17 14:41                                 ` [PATCH 3/3] thp: share get_huge_page_tail() Andrea Arcangeli
2011-10-17 14:41                                   ` Andrea Arcangeli
2011-10-17 21:32                               ` fix two more s390/sparc gup_fast bugs Andrea Arcangeli
2011-10-17 21:32                                 ` Andrea Arcangeli
2011-10-17 21:32                                 ` [PATCH 1/2] s390: gup_huge_pmd() return 0 if pte changes Andrea Arcangeli
2011-10-17 21:32                                   ` Andrea Arcangeli
2011-10-17 21:32                                 ` [PATCH 2/2] powerpc: " Andrea Arcangeli
2011-10-17 21:32                                   ` Andrea Arcangeli
2011-08-29 22:40                     ` [PATCH] thp: tail page refcounting fix #4 Michel Lespinasse
2011-08-29 22:40                       ` Michel Lespinasse
2011-08-29 23:30                       ` Andrea Arcangeli
2011-08-29 23:30                         ` Andrea Arcangeli
2011-08-26 19:28             ` [PATCH] thp: tail page refcounting fix #2 Andrea Arcangeli
2011-08-26 19:28               ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.