All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] page count lock for simpler put_page
@ 2011-08-04 21:07 Michel Lespinasse
  2011-08-04 21:07 ` [RFC PATCH 1/3] mm: Replace naked page->_count accesses with accessor functions Michel Lespinasse
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-04 21:07 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins
  Cc: linux-mm, Andrew Morton, Rik van Riel, Mel Gorman, Minchan Kim,
	Johannes Weiner, KOSAKI Motohiro, Shaohua Li

I'd like to sollicit comments on the following patches againse v3.0:

Patch 1 introduces new accessors for page->_count. I believe it is
actually ready for upstream submission.

Patch 2 introduces a new page count lock. Using it, the interaction between
__split_huge_page_refcount and put_compound_page is much simplified.
I would like to get comments about this part. One known issue is that
my implementation of the page count lock does not currently provide
the required memory barrier semantics on non-x86 architectures.
This could however be remedied if we decide to go ahead with the idea.

Patch 3 demonstrates my motivation for this patch series: in my pre-THP
implementation of idle page tracking, I was able to use get_page_unless_zero
in a way that __split_huge_page_refcount made unsafe. Building on top of
patch 2, I can make the required operation safe again. If patch 2 was to
be rejected, I would like to get suggestions about alternative approaches
to implement the get_first_page_unless_zero() operation described here.

Michel Lespinasse (3):
  mm: Replace naked page->_count accesses with accessor functions
  mm: page count lock
  mm: get_first_page_unless_zero()

 arch/powerpc/mm/gup.c                        |    4 +-
 arch/powerpc/platforms/512x/mpc512x_shared.c |    5 +-
 arch/x86/mm/gup.c                            |    6 +-
 drivers/net/niu.c                            |    4 +-
 include/linux/mm.h                           |   38 ++++++--
 include/linux/pagemap.h                      |   10 ++-
 mm/huge_memory.c                             |   25 +++---
 mm/internal.h                                |   97 ++++++++++++++++++++-
 mm/memory_hotplug.c                          |    4 +-
 mm/swap.c                                    |  122 +++++++++++---------------
 10 files changed, 205 insertions(+), 110 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/3] mm: Replace naked page->_count accesses with accessor functions
  2011-08-04 21:07 [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
@ 2011-08-04 21:07 ` Michel Lespinasse
  2011-08-04 21:07 ` [RFC PATCH 2/3] mm: page count lock Michel Lespinasse
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-04 21:07 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins
  Cc: linux-mm, Andrew Morton, Rik van Riel, Mel Gorman, Minchan Kim,
	Johannes Weiner, KOSAKI Motohiro, Shaohua Li

This change replaces all naked page->count accesses with accessor functions,
with few exceptions:
- assertions checking for equality with zero
- debug messages displaying the page count value

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/powerpc/mm/gup.c                        |    4 +-
 arch/powerpc/platforms/512x/mpc512x_shared.c |    5 ++-
 arch/x86/mm/gup.c                            |    6 ++--
 drivers/net/niu.c                            |    4 +-
 include/linux/mm.h                           |   27 ++++++++++++++++++++-----
 include/linux/pagemap.h                      |    4 +-
 mm/huge_memory.c                             |    9 ++++---
 mm/internal.h                                |   10 +++++++++
 mm/memory_hotplug.c                          |    4 +-
 mm/swap.c                                    |    6 ++--
 10 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index fec1320..7b7f846 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -22,8 +22,8 @@ static inline void get_huge_page_tail(struct page *page)
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(__page_count(page) < 0);
+	__get_page(page);
 }
 
 /*
diff --git a/arch/powerpc/platforms/512x/mpc512x_shared.c b/arch/powerpc/platforms/512x/mpc512x_shared.c
index e41ebbd..7c5a58a 100644
--- a/arch/powerpc/platforms/512x/mpc512x_shared.c
+++ b/arch/powerpc/platforms/512x/mpc512x_shared.c
@@ -18,6 +18,7 @@
 #include <linux/of_platform.h>
 #include <linux/fsl-diu-fb.h>
 #include <linux/bootmem.h>
+#include <linux/mm.h>
 #include <sysdev/fsl_soc.h>
 
 #include <asm/cacheflush.h>
@@ -200,8 +201,8 @@ static inline void mpc512x_free_bootmem(struct page *page)
 {
 	__ClearPageReserved(page);
 	BUG_ON(PageTail(page));
-	BUG_ON(atomic_read(&page->_count) > 1);
-	atomic_set(&page->_count, 1);
+	BUG_ON(page_count(page) > 1);
+	init_page_count(page);
 	__free_page(page);
 	totalram_pages++;
 }
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dbe34b9..30ea122 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -104,7 +104,7 @@ static inline void get_head_page_multiple(struct page *page, int nr)
 {
 	VM_BUG_ON(page != compound_head(page));
 	VM_BUG_ON(page_count(page) == 0);
-	atomic_add(nr, &page->_count);
+	__add_page_count(nr, page);
 	SetPageReferenced(page);
 }
 
@@ -114,8 +114,8 @@ static inline void get_huge_page_tail(struct page *page)
 	 * __split_huge_page_refcount() cannot run
 	 * from under us.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(__page_count(page) < 0);
+	__get_page(page);
 }
 
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
diff --git a/drivers/net/niu.c b/drivers/net/niu.c
index cc25bff..9057ab1 100644
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -3354,8 +3354,8 @@ static int niu_rbr_add_page(struct niu *np, struct rx_ring_info *rp,
 
 	niu_hash_page(rp, page, addr);
 	if (rp->rbr_blocks_per_page > 1)
-		atomic_add(rp->rbr_blocks_per_page - 1,
-			   &compound_head(page)->_count);
+		__add_page_count(rp->rbr_blocks_per_page - 1,
+				 compound_head(page));
 
 	for (i = 0; i < rp->rbr_blocks_per_page; i++) {
 		__le32 *rbr = &rp->rbr[start_index + i];
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..7984f90 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -266,12 +266,17 @@ struct inode;
  * routine so they can be sure the page doesn't go away from under them.
  */
 
+static inline int __page_count(struct page *page)
+{
+	return atomic_read(&page->_count);
+}
+
 /*
  * Drop a ref, return true if the refcount fell to zero (the page has no users)
  */
 static inline int put_page_testzero(struct page *page)
 {
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(__page_count(page) <= 0);
 	return atomic_dec_and_test(&page->_count);
 }
 
@@ -357,7 +362,17 @@ static inline struct page *compound_head(struct page *page)
 
 static inline int page_count(struct page *page)
 {
-	return atomic_read(&compound_head(page)->_count);
+	return __page_count(compound_head(page));
+}
+
+static inline void __add_page_count(int nr, struct page *page)
+{
+	atomic_add(nr, &page->_count);
+}
+
+static inline void __get_page(struct page *page)
+{
+	atomic_inc(&page->_count);
 }
 
 static inline void get_page(struct page *page)
@@ -370,8 +385,8 @@ static inline void get_page(struct page *page)
 	 * bugcheck only verifies that the page->_count isn't
 	 * negative.
 	 */
-	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
-	atomic_inc(&page->_count);
+	VM_BUG_ON(__page_count(page) < !PageTail(page));
+	__get_page(page);
 	/*
 	 * Getting a tail page will elevate both the head and tail
 	 * page->_count(s).
@@ -382,8 +397,8 @@ static inline void get_page(struct page *page)
 		 * __split_huge_page_refcount can't run under
 		 * get_page().
 		 */
-		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
-		atomic_inc(&page->first_page->_count);
+		VM_BUG_ON(__page_count(page->first_page) <= 0);
+		__get_page(page->first_page);
 	}
 }
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716875e..3dc3334 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -147,7 +147,7 @@ static inline int page_cache_get_speculative(struct page *page)
 	 * SMP requires.
 	 */
 	VM_BUG_ON(page_count(page) == 0);
-	atomic_inc(&page->_count);
+	__get_page(page);
 
 #else
 	if (unlikely(!get_page_unless_zero(page))) {
@@ -176,7 +176,7 @@ static inline int page_cache_add_speculative(struct page *page, int count)
 	VM_BUG_ON(!in_atomic());
 # endif
 	VM_BUG_ON(page_count(page) == 0);
-	atomic_add(count, &page->_count);
+	__add_page_count(count, page);
 
 #else
 	if (unlikely(!atomic_add_unless(&page->_count, count, 0)))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81532f2..2d45af2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1156,6 +1156,7 @@ static void __split_huge_page_refcount(struct page *page)
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
 	int zonestat;
+	int tail_counts = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1165,10 +1166,9 @@ static void __split_huge_page_refcount(struct page *page)
 		struct page *page_tail = page + i;
 
 		/* tail_page->_count cannot change */
-		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
-		BUG_ON(page_count(page) <= 0);
-		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
-		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+		tail_counts += __page_count(page_tail);
+		__add_page_count(page_mapcount(page) + 1, page_tail);
+		BUG_ON(__page_count(page_tail) <= 0);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1253,6 +1253,7 @@ static void __split_huge_page_refcount(struct page *page)
 		put_page(page_tail);
 	}
 
+	__sub_page_count(tail_counts, page);
 	/*
 	 * Only the head page (now become a regular page) is required
 	 * to be pinned by the caller.
diff --git a/mm/internal.h b/mm/internal.h
index d071d38..93d8da4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -32,11 +32,21 @@ static inline void set_page_refcounted(struct page *page)
 	set_page_count(page, 1);
 }
 
+static inline void __sub_page_count(int nr, struct page *page)
+{
+	atomic_sub(nr, &page->_count);
+}
+
 static inline void __put_page(struct page *page)
 {
 	atomic_dec(&page->_count);
 }
 
+static inline int __put_page_return(struct page *page)
+{
+	return atomic_dec_return(&page->_count) >> 1;
+}
+
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c46887b..06d0575 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -88,7 +88,7 @@ static void get_page_bootmem(unsigned long info,  struct page *page,
 	page->lru.next = (struct list_head *) type;
 	SetPagePrivate(page);
 	set_page_private(page, info);
-	atomic_inc(&page->_count);
+	__get_page(page);
 }
 
 /* reference to __meminit __free_pages_bootmem is valid
@@ -101,7 +101,7 @@ void __ref put_page_bootmem(struct page *page)
 	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
 	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
 
-	if (atomic_dec_return(&page->_count) == 1) {
+	if (__put_page_return(page) == 1) {
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
 		INIT_LIST_HEAD(&page->lru);
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..46ae089 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -128,9 +128,9 @@ static void put_compound_page(struct page *page)
 			if (put_page_testzero(page_head))
 				VM_BUG_ON(1);
 			/* __split_huge_page_refcount will wait now */
-			VM_BUG_ON(atomic_read(&page->_count) <= 0);
-			atomic_dec(&page->_count);
-			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			VM_BUG_ON(__page_count(page) <= 0);
+			__put_page(page);
+			VM_BUG_ON(__page_count(page_head) <= 0);
 			compound_unlock_irqrestore(page_head, flags);
 			if (put_page_testzero(page_head)) {
 				if (PageHead(page_head))
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/3] mm: page count lock
  2011-08-04 21:07 [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
  2011-08-04 21:07 ` [RFC PATCH 1/3] mm: Replace naked page->_count accesses with accessor functions Michel Lespinasse
@ 2011-08-04 21:07 ` Michel Lespinasse
  2011-08-07 14:00   ` Minchan Kim
  2011-08-04 21:07 ` [RFC PATCH 3/3] mm: get_first_page_unless_zero() Michel Lespinasse
  2011-08-05  6:39 ` [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
  3 siblings, 1 reply; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-04 21:07 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins
  Cc: linux-mm, Andrew Morton, Rik van Riel, Mel Gorman, Minchan Kim,
	Johannes Weiner, KOSAKI Motohiro, Shaohua Li

This change introduces a new lock in order to simplify the way
__split_huge_page_refcount and put_compound_page interact.

The synchronization problem in this code is that when operating on
tail pages, put_page() needs to adjust page counts for both the tail
and head pages. On the other hand, when splitting compound pages
__split_huge_page_refcount() needs to adjust the head page count so that
it does not reflect tail page references anymore. When the two race
together, they must agree as to the order things happen so that the head
page reference count does not end up with an improper value.

I propose doing this using a new lock on the tail page. Compared to
the previous version using the compound lock on the head page,
the compound page case of put_page() ends up being much simpler.

The new lock is implemented using the lowest bit of page->_count.
Page count accessor functions are modified to handle this transparently.
New accessors are added in mm/internal.h to lock/unlock the
page count lock while simultaneously accessing the page count value.
The number of atomic operations required is thus minimized.

Note that the current implementation takes advantage of the implicit
memory barrier provided by x86 on atomic RMW instructions to provide
the expected lock/unlock semantics. Clearly this is not portable
accross architectures, and will have to be accomodated for using
an explicit memory barrier on architectures that require it.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 include/linux/mm.h      |   17 +++++---
 include/linux/pagemap.h |    6 ++-
 mm/huge_memory.c        |   20 ++++-----
 mm/internal.h           |   68 ++++++++++++++++++++++++++++++--
 mm/swap.c               |   98 ++++++++++++-----------------------------------
 5 files changed, 113 insertions(+), 96 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7984f90..fa64aa7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -266,9 +266,14 @@ struct inode;
  * routine so they can be sure the page doesn't go away from under them.
  */
 
+#define _PAGE_COUNT_LOCK (1 << 0)
+
+#define _PAGE_COUNT_SHIFT 1
+#define _PAGE_COUNT_ONE (1 << _PAGE_COUNT_SHIFT)
+
 static inline int __page_count(struct page *page)
 {
-	return atomic_read(&page->_count);
+	return atomic_read(&page->_count) >> _PAGE_COUNT_SHIFT;
 }
 
 /*
@@ -277,7 +282,7 @@ static inline int __page_count(struct page *page)
 static inline int put_page_testzero(struct page *page)
 {
 	VM_BUG_ON(__page_count(page) <= 0);
-	return atomic_dec_and_test(&page->_count);
+	return atomic_sub_and_test(_PAGE_COUNT_ONE, &page->_count);
 }
 
 /*
@@ -286,7 +291,7 @@ static inline int put_page_testzero(struct page *page)
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-	return atomic_inc_not_zero(&page->_count);
+	return atomic_add_unless(&page->_count, _PAGE_COUNT_ONE, 0);
 }
 
 extern int page_is_ram(unsigned long pfn);
@@ -367,12 +372,12 @@ static inline int page_count(struct page *page)
 
 static inline void __add_page_count(int nr, struct page *page)
 {
-	atomic_add(nr, &page->_count);
+	atomic_add(nr << _PAGE_COUNT_SHIFT, &page->_count);
 }
 
 static inline void __get_page(struct page *page)
 {
-	atomic_inc(&page->_count);
+	__add_page_count(1, page);
 }
 
 static inline void get_page(struct page *page)
@@ -414,7 +419,7 @@ static inline struct page *virt_to_head_page(const void *x)
  */
 static inline void init_page_count(struct page *page)
 {
-	atomic_set(&page->_count, 1);
+	atomic_set(&page->_count, _PAGE_COUNT_ONE);
 }
 
 /*
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3dc3334..e9ec235 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -179,7 +179,8 @@ static inline int page_cache_add_speculative(struct page *page, int count)
 	__add_page_count(count, page);
 
 #else
-	if (unlikely(!atomic_add_unless(&page->_count, count, 0)))
+	if (unlikely(!atomic_add_unless(&page->_count,
+					count << _PAGE_COUNT_SHIFT, 0)))
 		return 0;
 #endif
 	VM_BUG_ON(PageCompound(page) && page != compound_head(page));
@@ -189,6 +190,7 @@ static inline int page_cache_add_speculative(struct page *page, int count)
 
 static inline int page_freeze_refs(struct page *page, int count)
 {
+	count <<= _PAGE_COUNT_SHIFT;
 	return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
 }
 
@@ -197,7 +199,7 @@ static inline void page_unfreeze_refs(struct page *page, int count)
 	VM_BUG_ON(page_count(page) != 0);
 	VM_BUG_ON(count == 0);
 
-	atomic_set(&page->_count, count);
+	atomic_set(&page->_count, count << _PAGE_COUNT_SHIFT);
 }
 
 #ifdef CONFIG_NUMA
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2d45af2..8c0295f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1165,14 +1165,16 @@ static void __split_huge_page_refcount(struct page *page)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 
-		/* tail_page->_count cannot change */
-		tail_counts += __page_count(page_tail);
-		__add_page_count(page_mapcount(page) + 1, page_tail);
+		/*
+		 * To prevent a race against put_page(), reading the
+		 * tail page count value (prior to splitting) and
+		 * clearing the PageTail flag must be done together
+		 * under protection of the tail page count lock.
+		 */
+		tail_counts += lock_add_page_count(page_mapcount(page) + 1,
+						   page_tail);
 		BUG_ON(__page_count(page_tail) <= 0);
 
-		/* after clearing PageTail the gup refcount can be released */
-		smp_mb();
-
 		/*
 		 * retain hwpoison flag of the poisoned tail page:
 		 *   fix for the unsuitable process killed on Guest Machine(KVM)
@@ -1186,11 +1188,7 @@ static void __split_huge_page_refcount(struct page *page)
 				      (1L << PG_uptodate)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/*
-		 * 1) clear PageTail before overwriting first_page
-		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
-		 */
-		smp_wmb();
+		unlock_page_count(page_tail);
 
 		/*
 		 * __split_huge_page_splitting() already set the
diff --git a/mm/internal.h b/mm/internal.h
index 93d8da4..8dde36d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -18,7 +18,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 
 static inline void set_page_count(struct page *page, int v)
 {
-	atomic_set(&page->_count, v);
+	atomic_set(&page->_count, v << _PAGE_COUNT_SHIFT);
 }
 
 /*
@@ -34,17 +34,77 @@ static inline void set_page_refcounted(struct page *page)
 
 static inline void __sub_page_count(int nr, struct page *page)
 {
-	atomic_sub(nr, &page->_count);
+	atomic_sub(nr << _PAGE_COUNT_SHIFT, &page->_count);
 }
 
 static inline void __put_page(struct page *page)
 {
-	atomic_dec(&page->_count);
+	__sub_page_count(1, page);
 }
 
 static inline int __put_page_return(struct page *page)
 {
-	return atomic_dec_return(&page->_count) >> 1;
+	return atomic_sub_return(_PAGE_COUNT_ONE,
+				 &page->_count) >> _PAGE_COUNT_SHIFT;
+}
+
+static inline int lock_add_page_count(int nr, struct page *page)
+{
+	int count, prev, next;
+
+retry_spin:
+	count = atomic_read(&page->_count);
+retry:
+	if (count & _PAGE_COUNT_LOCK) {
+		cpu_relax();
+		goto retry_spin;
+	}
+	prev = count;
+	next = count + (nr << _PAGE_COUNT_SHIFT) + _PAGE_COUNT_LOCK;
+	preempt_disable();
+	count = atomic_cmpxchg(&page->_count, prev, next);
+	if (count != prev) {
+		preempt_enable();
+		goto retry;
+	}
+	__acquire(page_count_lock);
+	return count >> _PAGE_COUNT_SHIFT;
+}
+
+static inline void lock_page_count(struct page *page)
+{
+	/* Faster implementation would be possible using atomic test and set,
+	   but linux only provides atomic bit operations on long types... */
+	lock_add_page_count(0, page);
+}
+
+static inline void unlock_page_count(struct page *page)
+{
+	VM_BUG_ON(!(atomic_read(&page->_count) & _PAGE_COUNT_LOCK));
+	BUG_ON(_PAGE_COUNT_LOCK != 1);
+	atomic_dec(&page->_count);
+	preempt_enable();
+	__release(page_count_lock);
+}
+
+static inline void unlock_sub_page_count(int nr, struct page *page)
+{
+	VM_BUG_ON(!(atomic_read(&page->_count) & _PAGE_COUNT_LOCK));
+	atomic_sub((nr << _PAGE_COUNT_SHIFT) + _PAGE_COUNT_LOCK,
+		   &page->_count);
+	preempt_enable();
+	__release(page_count_lock);
+}
+
+static inline int unlock_sub_test_page_count(int nr, struct page *page)
+{
+	int zero;
+	VM_BUG_ON(!(atomic_read(&page->_count) & _PAGE_COUNT_LOCK));
+	zero = atomic_sub_and_test((nr << _PAGE_COUNT_SHIFT) + _PAGE_COUNT_LOCK,
+				   &page->_count);
+	preempt_enable();
+	__release(page_count_lock);
+	return zero;
 }
 
 extern unsigned long highest_memmap_pfn;
diff --git a/mm/swap.c b/mm/swap.c
index 46ae089..1e91a1b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -74,90 +74,42 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
-static void put_compound_page(struct page *page)
+void put_page(struct page *page)
 {
 	if (unlikely(PageTail(page))) {
-		/* __split_huge_page_refcount can run under us */
-		struct page *page_head = page->first_page;
-		smp_rmb();
+		struct page *page_tail = page;
+
 		/*
-		 * If PageTail is still set after smp_rmb() we can be sure
-		 * that the page->first_page we read wasn't a dangling pointer.
-		 * See __split_huge_page_refcount() smp_wmb().
+		 * To prevent a race against __split_huge_page_refcount(),
+		 * updating the tail page count and checking the
+		 * TailPage flag must be done together under
+		 * protection of the tail page count lock.
 		 */
-		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
-			unsigned long flags;
-			/*
-			 * Verify that our page_head wasn't converted
-			 * to a a regular page before we got a
-			 * reference on it.
-			 */
-			if (unlikely(!PageHead(page_head))) {
-				/* PageHead is cleared after PageTail */
-				smp_rmb();
-				VM_BUG_ON(PageTail(page));
-				goto out_put_head;
-			}
-			/*
-			 * Only run compound_lock on a valid PageHead,
-			 * after having it pinned with
-			 * get_page_unless_zero() above.
-			 */
-			smp_mb();
-			/* page_head wasn't a dangling pointer */
-			flags = compound_lock_irqsave(page_head);
-			if (unlikely(!PageTail(page))) {
-				/* __split_huge_page_refcount run before us */
-				compound_unlock_irqrestore(page_head, flags);
-				VM_BUG_ON(PageHead(page_head));
-			out_put_head:
-				if (put_page_testzero(page_head))
-					__put_single_page(page_head);
-			out_put_single:
-				if (put_page_testzero(page))
-					__put_single_page(page);
-				return;
-			}
-			VM_BUG_ON(page_head != page->first_page);
-			/*
-			 * We can release the refcount taken by
-			 * get_page_unless_zero now that
-			 * split_huge_page_refcount is blocked on the
-			 * compound_lock.
-			 */
-			if (put_page_testzero(page_head))
-				VM_BUG_ON(1);
-			/* __split_huge_page_refcount will wait now */
+		lock_page_count(page);
+		if (unlikely(!PageTail(page))) {
 			VM_BUG_ON(__page_count(page) <= 0);
-			__put_page(page);
-			VM_BUG_ON(__page_count(page_head) <= 0);
-			compound_unlock_irqrestore(page_head, flags);
-			if (put_page_testzero(page_head)) {
-				if (PageHead(page_head))
-					__put_compound_page(page_head);
-				else
-					__put_single_page(page_head);
-			}
-		} else {
-			/* page_head is a dangling pointer */
-			VM_BUG_ON(PageTail(page));
-			goto out_put_single;
+			if (unlock_sub_test_page_count(1, page))
+				__put_single_page(page);
+			return;
 		}
-	} else if (put_page_testzero(page)) {
+
+		/*
+		 * The head page must be located under protection of the
+		 * tail page count lock, but we can release this lock
+		 * before putting the head page.
+		 */
+		page = page->first_page;
+		VM_BUG_ON(__page_count(page_tail) <= 0);
+		unlock_sub_page_count(1, page_tail);
+	}
+
+	if (put_page_testzero(page)) {
 		if (PageHead(page))
 			__put_compound_page(page);
 		else
 			__put_single_page(page);
 	}
 }
-
-void put_page(struct page *page)
-{
-	if (unlikely(PageCompound(page)))
-		put_compound_page(page);
-	else if (put_page_testzero(page))
-		__put_single_page(page);
-}
 EXPORT_SYMBOL(put_page);
 
 /**
@@ -575,7 +527,7 @@ void release_pages(struct page **pages, int nr, int cold)
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
 				zone = NULL;
 			}
-			put_compound_page(page);
+			put_page(page);
 			continue;
 		}
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 3/3] mm: get_first_page_unless_zero()
  2011-08-04 21:07 [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
  2011-08-04 21:07 ` [RFC PATCH 1/3] mm: Replace naked page->_count accesses with accessor functions Michel Lespinasse
  2011-08-04 21:07 ` [RFC PATCH 2/3] mm: page count lock Michel Lespinasse
@ 2011-08-04 21:07 ` Michel Lespinasse
  2011-08-07 14:13   ` Minchan Kim
  2011-08-05  6:39 ` [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
  3 siblings, 1 reply; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-04 21:07 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins
  Cc: linux-mm, Andrew Morton, Rik van Riel, Mel Gorman, Minchan Kim,
	Johannes Weiner, KOSAKI Motohiro, Shaohua Li

This change introduces a new get_page_unless_zero() function, to be
used for idle page tracking in a a future patch series. It also
illustrates why I care about introducing the page count lock discussed
in the previous commit.

To explain the context: for idle page tracking, I am scanning pages
at a known rate based on their physical address. I want to find out
if pages have been referenced since the last scan using page_referenced(),
but before that I must acquire a reference on the page and to basic
checks about the page type. Before THP, it was safe to acquire references
using get_page_unless_zero(), but this won't work with in THP enabled kernel
due to the possible race with __split_huge_page_refcount(). Thus, the new
proposed get_first_page_unless_zero() function:

- must act like get_page_unless_zero() if the page is not a tail page;
- returns 0 for tail pages.

Without the page count lock I'm proposing, other approaches don't work
as well to provide mutual exclusion with __split_huge_page_refcount():

- using the zone LRU lock would work, but has a low granularity and
  exhibits contention under some of our workloads
- using the page compound lock on some head page wouldn't work well:
  suppose the page we want to get() is currently a single page, but we
  don't hold a reference on it. It can disappear at any time and get
  replaced with a tail page, so it's unsafe to just get a reference
  count on it. OTOH if it does NOT get replaced with a tail page, there
  is no head page to take a compound lock on.
- tricks involving page table locks, disabling interrupts to prevent
  TLB shootdown, or PMD splitting flag, don't work because we don't know
  of an existing mapping for the page.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/internal.h |   25 +++++++++++++++++++++++++
 mm/swap.c     |   26 ++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8dde36d..7894a33 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -48,6 +48,31 @@ static inline int __put_page_return(struct page *page)
 				 &page->_count) >> _PAGE_COUNT_SHIFT;
 }
 
+static inline int lock_page_count_get_unless_zero(struct page *page)
+{
+	int count, prev, next;
+
+retry_spin:
+	count = atomic_read(&page->_count);
+retry:
+	if (count < _PAGE_COUNT_ONE)
+		return 0;
+	else if (count & _PAGE_COUNT_LOCK) {
+		cpu_relax();
+		goto retry_spin;
+	}
+	prev = count;
+	next = count + _PAGE_COUNT_ONE + _PAGE_COUNT_LOCK;
+	preempt_disable();
+	count = atomic_cmpxchg(&page->_count, prev, next);
+	if (count != prev) {
+		preempt_enable();
+		goto retry;
+	}
+	__acquire(page_count_lock);
+	return 1;
+}
+
 static inline int lock_add_page_count(int nr, struct page *page)
 {
 	int count, prev, next;
diff --git a/mm/swap.c b/mm/swap.c
index 1e91a1b..27cbb14 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -112,6 +112,32 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+int get_first_page_unless_zero(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		return 0;
+
+	/*
+	 * As we do not have a reference count on the page, it could get
+	 * freed anytime and reallocated as a tail page. And if it does, it
+	 * would be unsafe to just increase that tail page's reference count,
+	 * as __split_huge_page_refcount() could then race against us and drop
+	 * an extra reference on the corresponding head page.
+	 *
+	 * Taking the page count lock here protects us against this scenario.
+	 */
+	if (!lock_page_count_get_unless_zero(page))
+		return 0;
+
+	if (unlikely(PageTail(page))) {
+		unlock_sub_page_count(1, page);
+		return 0;
+	}
+	unlock_page_count(page);
+	return 1;
+}
+EXPORT_SYMBOL(get_first_page_unless_zero);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-04 21:07 [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
                   ` (2 preceding siblings ...)
  2011-08-04 21:07 ` [RFC PATCH 3/3] mm: get_first_page_unless_zero() Michel Lespinasse
@ 2011-08-05  6:39 ` Michel Lespinasse
  2011-08-07 14:25   ` Minchan Kim
  2011-08-12 16:58   ` Andrea Arcangeli
  3 siblings, 2 replies; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-05  6:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins
  Cc: linux-mm, Andrew Morton, Rik van Riel, Mel Gorman, Minchan Kim,
	Johannes Weiner, KOSAKI Motohiro, Shaohua Li

On Thu, Aug 4, 2011 at 2:07 PM, Michel Lespinasse <walken@google.com> wrote:
> Patch 3 demonstrates my motivation for this patch series: in my pre-THP
> implementation of idle page tracking, I was able to use get_page_unless_zero
> in a way that __split_huge_page_refcount made unsafe. Building on top of
> patch 2, I can make the required operation safe again. If patch 2 was to
> be rejected, I would like to get suggestions about alternative approaches
> to implement the get_first_page_unless_zero() operation described here.

I should add that I am quite worried about the places that use
get_page_unless_zero (or the page_cache_*_speculative wrappers) today.
My worrisome scenario would be as follows:

- thread T finds a pointer to a page P (possibly from a radix tree in
find_get_page() )
- page P gets freed by another thread
- page P gets re-allocated as the tail of a THP page by another thread
- another thread gets a reference on page P
- thread T proceeds doing page_cache_get_speculative(P), intending to
then check that P is really the page it wanted
- another thread splits up P's compound page;
__split_huge_page_refcount subtracts T's refcount on P from head(P)'s
refcount
- thread T figures out that it didn't get the page it expected, calls
page_cache_release(P). But it's too late - the refcount for what used
to be head(P) has already been corrupted (incorrectly decremented).

Does anything prevent the above ?

I can see that the page_cache_get_speculative comment in
include/linux/pagemap.h maps out one way to prevent the issue. If
thread T continually held an rcu read lock from the time it finds the
pointer to P until the time it calls get_page_unless_zero on that
page, AND there was a synchronize_rcu() call somewhere between the
time a THP page gets allocated and the time __split_huge_page_refcount
might first get called on that page, then things would be safe.
However, that does not seem to be true today: I could not find a
synchronize_rcu() call before __split_huge_page_refcount(), AND there
are also places (such as deactivate_page() for example) that call
get_page_unless_zero without being within an rcu read locked section
(or holding the zone lru lock to provide exclusion against
__split_huge_page_refcount).

Is there another protection mechanism that I have not considered ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/3] mm: page count lock
  2011-08-04 21:07 ` [RFC PATCH 2/3] mm: page count lock Michel Lespinasse
@ 2011-08-07 14:00   ` Minchan Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2011-08-07 14:00 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Hugh Dickins, linux-mm, Andrew Morton,
	Rik van Riel, Mel Gorman, Johannes Weiner, KOSAKI Motohiro,
	Shaohua Li

On Thu, Aug 04, 2011 at 02:07:21PM -0700, Michel Lespinasse wrote:
> This change introduces a new lock in order to simplify the way
> __split_huge_page_refcount and put_compound_page interact.
> 
> The synchronization problem in this code is that when operating on
> tail pages, put_page() needs to adjust page counts for both the tail
> and head pages. On the other hand, when splitting compound pages
> __split_huge_page_refcount() needs to adjust the head page count so that
> it does not reflect tail page references anymore. When the two race
> together, they must agree as to the order things happen so that the head
> page reference count does not end up with an improper value.
> 
> I propose doing this using a new lock on the tail page. Compared to
> the previous version using the compound lock on the head page,
> the compound page case of put_page() ends up being much simpler.
> 
> The new lock is implemented using the lowest bit of page->_count.
> Page count accessor functions are modified to handle this transparently.
> New accessors are added in mm/internal.h to lock/unlock the
> page count lock while simultaneously accessing the page count value.
> The number of atomic operations required is thus minimized.
> 
> Note that the current implementation takes advantage of the implicit
> memory barrier provided by x86 on atomic RMW instructions to provide
> the expected lock/unlock semantics. Clearly this is not portable
> accross architectures, and will have to be accomodated for using
> an explicit memory barrier on architectures that require it.
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

I didn't take a long time to find out any faults but I see the approach and
it seems no problem except barrier stuff.
I agree this patch makes simple thing complicated by THP in put_page.
It would be very good about readability. :)

But the concern is that put_page on tail page is rare operation but get_page is very
often one. And you are going to enhance readability as scarificing the performance.
A shift operation cost would be negligible but at least we need the number.

If it doesn't hurt performance, I absolutely support your patch!.
Because your patch would reduce many atomic opeartion on head page of put_page
as well as readbility.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/3] mm: get_first_page_unless_zero()
  2011-08-04 21:07 ` [RFC PATCH 3/3] mm: get_first_page_unless_zero() Michel Lespinasse
@ 2011-08-07 14:13   ` Minchan Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2011-08-07 14:13 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Hugh Dickins, linux-mm, Andrew Morton,
	Rik van Riel, Mel Gorman, Johannes Weiner, KOSAKI Motohiro,
	Shaohua Li

On Thu, Aug 04, 2011 at 02:07:22PM -0700, Michel Lespinasse wrote:
> This change introduces a new get_page_unless_zero() function, to be
> used for idle page tracking in a a future patch series. It also
> illustrates why I care about introducing the page count lock discussed
> in the previous commit.
> 
> To explain the context: for idle page tracking, I am scanning pages
> at a known rate based on their physical address. I want to find out
> if pages have been referenced since the last scan using page_referenced(),
> but before that I must acquire a reference on the page and to basic
> checks about the page type. Before THP, it was safe to acquire references
> using get_page_unless_zero(), but this won't work with in THP enabled kernel
> due to the possible race with __split_huge_page_refcount(). Thus, the new
> proposed get_first_page_unless_zero() function:
> 
> - must act like get_page_unless_zero() if the page is not a tail page;
> - returns 0 for tail pages.
> 
> Without the page count lock I'm proposing, other approaches don't work
> as well to provide mutual exclusion with __split_huge_page_refcount():
> 
> - using the zone LRU lock would work, but has a low granularity and
>   exhibits contention under some of our workloads

I thougt this but it seems your concern is LRU lock contention.

This patch doesn't include any use case(Sometime it hurts reviewers)
but I expect it's in idle tracking patch set.
But we don't conclude yet idle page tracking patchset is reasonable
or not to merge mainline. So, I think it's rather rash idea.
(But I admit [1,2/3] is enough to discuss regardless of idle page tracking)

What I suggestion is as follows,

1. Replace naked page->_count accesses with accessor functions
2. page count lock
3. idle page tracking with simple lock(ex, zone->lru_lock)
4. get_first_page_unless_zero to optimize lock overhead.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-05  6:39 ` [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
@ 2011-08-07 14:25   ` Minchan Kim
  2011-08-09 11:04     ` Michel Lespinasse
  2011-08-12 16:58   ` Andrea Arcangeli
  1 sibling, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2011-08-07 14:25 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Hugh Dickins, linux-mm, Andrew Morton,
	Rik van Riel, Mel Gorman, Johannes Weiner, KOSAKI Motohiro,
	Shaohua Li

On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote:
> On Thu, Aug 4, 2011 at 2:07 PM, Michel Lespinasse <walken@google.com> wrote:
> > Patch 3 demonstrates my motivation for this patch series: in my pre-THP
> > implementation of idle page tracking, I was able to use get_page_unless_zero
> > in a way that __split_huge_page_refcount made unsafe. Building on top of
> > patch 2, I can make the required operation safe again. If patch 2 was to
> > be rejected, I would like to get suggestions about alternative approaches
> > to implement the get_first_page_unless_zero() operation described here.
> 
> I should add that I am quite worried about the places that use
> get_page_unless_zero (or the page_cache_*_speculative wrappers) today.
> My worrisome scenario would be as follows:
> 
> - thread T finds a pointer to a page P (possibly from a radix tree in
> find_get_page() )
> - page P gets freed by another thread
> - page P gets re-allocated as the tail of a THP page by another thread
> - another thread gets a reference on page P
> - thread T proceeds doing page_cache_get_speculative(P), intending to
> then check that P is really the page it wanted
> - another thread splits up P's compound page;
> __split_huge_page_refcount subtracts T's refcount on P from head(P)'s
> refcount
> - thread T figures out that it didn't get the page it expected, calls
> page_cache_release(P). But it's too late - the refcount for what used
> to be head(P) has already been corrupted (incorrectly decremented).
> 
> Does anything prevent the above ?

I think it's possbile and you find a BUG.
Andrea?

> 
> I can see that the page_cache_get_speculative comment in
> include/linux/pagemap.h maps out one way to prevent the issue. If
> thread T continually held an rcu read lock from the time it finds the
> pointer to P until the time it calls get_page_unless_zero on that
> page, AND there was a synchronize_rcu() call somewhere between the
> time a THP page gets allocated and the time __split_huge_page_refcount
> might first get called on that page, then things would be safe.
> However, that does not seem to be true today: I could not find a
> synchronize_rcu() call before __split_huge_page_refcount(), AND there
> are also places (such as deactivate_page() for example) that call
> get_page_unless_zero without being within an rcu read locked section
> (or holding the zone lru lock to provide exclusion against
> __split_huge_page_refcount).

When I make deactivate_page, I didn't consider that honestly.
IMHO, It shouldn't be a problem as deactive_page hold a reference
of page by pagevec_lookup so the page shouldn't be gone under us.
And at the moment, deactive_page is used by only invalidate_mapping_pages
which handles only file pages but THP handles only anon pages.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-07 14:25   ` Minchan Kim
@ 2011-08-09 11:04     ` Michel Lespinasse
  2011-08-09 22:22       ` Minchan Kim
  2011-08-12 15:36       ` Andrea Arcangeli
  0 siblings, 2 replies; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-09 11:04 UTC (permalink / raw)
  To: Minchan Kim, Andrea Arcangeli, Hugh Dickins, linux-mm,
	Andrew Morton, Paul E. McKenney
  Cc: Rik van Riel, Mel Gorman, Johannes Weiner, KOSAKI Motohiro, Shaohua Li

On Sun, Aug 7, 2011 at 7:25 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote:
>> On Thu, Aug 4, 2011 at 2:07 PM, Michel Lespinasse <walken@google.com> wrote:
>> > Patch 3 demonstrates my motivation for this patch series: in my pre-THP
>> > implementation of idle page tracking, I was able to use get_page_unless_zero
>> > in a way that __split_huge_page_refcount made unsafe. Building on top of
>> > patch 2, I can make the required operation safe again. If patch 2 was to
>> > be rejected, I would like to get suggestions about alternative approaches
>> > to implement the get_first_page_unless_zero() operation described here.
>>
>> I should add that I am quite worried about the places that use
>> get_page_unless_zero (or the page_cache_*_speculative wrappers) today.
>> My worrisome scenario would be as follows:
>>
>> - thread T finds a pointer to a page P (possibly from a radix tree in
>> find_get_page() )
>> - page P gets freed by another thread
>> - page P gets re-allocated as the tail of a THP page by another thread
>> - another thread gets a reference on page P
>> - thread T proceeds doing page_cache_get_speculative(P), intending to
>> then check that P is really the page it wanted
>> - another thread splits up P's compound page;
>> __split_huge_page_refcount subtracts T's refcount on P from head(P)'s
>> refcount
>> - thread T figures out that it didn't get the page it expected, calls
>> page_cache_release(P). But it's too late - the refcount for what used
>> to be head(P) has already been corrupted (incorrectly decremented).
>>
>> Does anything prevent the above ?
>
> I think it's possbile and you find a BUG.
> Andrea?

At this point I believe this is indeed a bug, though a very unlikely
one to hit. The interval between thread T finding a pointer to page P
and thread T getting a reference on P is typically just a few
instructions, while the time scale necessary for other threads to free
page P, reallocate it as a compound page, and decide to split it into
single pages is just much larger. So, there is no line of code I can
point to that would prevent the race, but it also looks like it would
be very hard to trigger it.

>> I can see that the page_cache_get_speculative comment in
>> include/linux/pagemap.h maps out one way to prevent the issue. If
>> thread T continually held an rcu read lock from the time it finds the
>> pointer to P until the time it calls get_page_unless_zero on that
>> page, AND there was a synchronize_rcu() call somewhere between the
>> time a THP page gets allocated and the time __split_huge_page_refcount
>> might first get called on that page, then things would be safe.
>> However, that does not seem to be true today: I could not find a
>> synchronize_rcu() call before __split_huge_page_refcount(), AND there
>> are also places (such as deactivate_page() for example) that call
>> get_page_unless_zero without being within an rcu read locked section
>> (or holding the zone lru lock to provide exclusion against
>> __split_huge_page_refcount).

Going forward, I can see several possible solutions:
- Use my proposed page count lock in order to avoid the race. One
would have to convert all get_page_unless_zero() sites to use it. I
expect the cost would be low but still measurable.
- Protect all get_page_unless_zero call sites with rcu read lock or
lru lock (page_cache_get_speculative already has it, but there are
others to consider), and add a synchronize_rcu() before splitting huge
pages.
- It'd be sweet if one could somehow record the time a THP page was
created, and wait for at least one RCU grace period *starting from the
recorded THP creation time* before splitting huge pages. In practice,
we would be very unlikely to have to wait since the grace period would
be already expired. However, I don't think RCU currently provides such
a mechanism - Paul, is this something that would seem easy to
implement or not ?
- Do nothing and hope one doesn't hit the race. This is not my
favourite "solution", but OTOH the race seems so hard to hit that it
may be hard to justify expensive solutions to work around it.

> When I make deactivate_page, I didn't consider that honestly.
> IMHO, It shouldn't be a problem as deactive_page hold a reference
> of page by pagevec_lookup so the page shouldn't be gone under us.

Agree - it seems like you are guaranteed to already hold a reference
(but then a straight get_page should be sufficient, right ?)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-09 11:04     ` Michel Lespinasse
@ 2011-08-09 22:22       ` Minchan Kim
  2011-08-12 22:35         ` Michel Lespinasse
  2011-08-12 15:36       ` Andrea Arcangeli
  1 sibling, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2011-08-09 22:22 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Hugh Dickins, linux-mm, Andrew Morton,
	Paul E. McKenney, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Tue, Aug 9, 2011 at 8:04 PM, Michel Lespinasse <walken@google.com> wrote:
> On Sun, Aug 7, 2011 at 7:25 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote:
>>> On Thu, Aug 4, 2011 at 2:07 PM, Michel Lespinasse <walken@google.com> wrote:
>>> > Patch 3 demonstrates my motivation for this patch series: in my pre-THP
>>> > implementation of idle page tracking, I was able to use get_page_unless_zero
>>> > in a way that __split_huge_page_refcount made unsafe. Building on top of
>>> > patch 2, I can make the required operation safe again. If patch 2 was to
>>> > be rejected, I would like to get suggestions about alternative approaches
>>> > to implement the get_first_page_unless_zero() operation described here.
>>>
>>> I should add that I am quite worried about the places that use
>>> get_page_unless_zero (or the page_cache_*_speculative wrappers) today.
>>> My worrisome scenario would be as follows:
>>>
>>> - thread T finds a pointer to a page P (possibly from a radix tree in
>>> find_get_page() )
>>> - page P gets freed by another thread
>>> - page P gets re-allocated as the tail of a THP page by another thread
>>> - another thread gets a reference on page P
>>> - thread T proceeds doing page_cache_get_speculative(P), intending to
>>> then check that P is really the page it wanted
>>> - another thread splits up P's compound page;
>>> __split_huge_page_refcount subtracts T's refcount on P from head(P)'s
>>> refcount
>>> - thread T figures out that it didn't get the page it expected, calls
>>> page_cache_release(P). But it's too late - the refcount for what used
>>> to be head(P) has already been corrupted (incorrectly decremented).
>>>
>>> Does anything prevent the above ?
>>
>> I think it's possbile and you find a BUG.
>> Andrea?
>
> At this point I believe this is indeed a bug, though a very unlikely
> one to hit. The interval between thread T finding a pointer to page P
> and thread T getting a reference on P is typically just a few
> instructions, while the time scale necessary for other threads to free
> page P, reallocate it as a compound page, and decide to split it into
> single pages is just much larger. So, there is no line of code I can
> point to that would prevent the race, but it also looks like it would
> be very hard to trigger it.

Fair enough.

>
>>> I can see that the page_cache_get_speculative comment in
>>> include/linux/pagemap.h maps out one way to prevent the issue. If
>>> thread T continually held an rcu read lock from the time it finds the
>>> pointer to P until the time it calls get_page_unless_zero on that
>>> page, AND there was a synchronize_rcu() call somewhere between the
>>> time a THP page gets allocated and the time __split_huge_page_refcount
>>> might first get called on that page, then things would be safe.
>>> However, that does not seem to be true today: I could not find a
>>> synchronize_rcu() call before __split_huge_page_refcount(), AND there
>>> are also places (such as deactivate_page() for example) that call
>>> get_page_unless_zero without being within an rcu read locked section
>>> (or holding the zone lru lock to provide exclusion against
>>> __split_huge_page_refcount).
>
> Going forward, I can see several possible solutions:
> - Use my proposed page count lock in order to avoid the race. One
> would have to convert all get_page_unless_zero() sites to use it. I
> expect the cost would be low but still measurable.

It's not necessary to apply it on *all* get_page_unless_zero sites.
Because deactivate_page does it on file pages while THP handles only anon pages.
So the race should not a problem.

> - Protect all get_page_unless_zero call sites with rcu read lock or
> lru lock (page_cache_get_speculative already has it, but there are
> others to consider), and add a synchronize_rcu() before splitting huge
> pages.

I think it can't be a solution.
If we don't have any lock for protect write-side, page_count could be
unstable again while we peek page->count in
__split_huge_page_refcount after calling synchronize_rcu.
Do I miss something?

> - It'd be sweet if one could somehow record the time a THP page was
> created, and wait for at least one RCU grace period *starting from the
> recorded THP creation time* before splitting huge pages. In practice,
> we would be very unlikely to have to wait since the grace period would
> be already expired. However, I don't think RCU currently provides such
> a mechanism - Paul, is this something that would seem easy to
> implement or not ?
> - Do nothing and hope one doesn't hit the race. This is not my
> favourite "solution", but OTOH the race seems so hard to hit that it

Others might hate it. :)

> may be hard to justify expensive solutions to work around it.

We should fix it with nice idea. :)
At the moment, your proposed approach would be nice to me.
But I expect Andrea didn't like it because he always want to minimize
change of other code by THP.

>
>> When I make deactivate_page, I didn't consider that honestly.
>> IMHO, It shouldn't be a problem as deactive_page hold a reference
>> of page by pagevec_lookup so the page shouldn't be gone under us.
>
> Agree - it seems like you are guaranteed to already hold a reference
> (but then a straight get_page should be sufficient, right ?)

I think the point is that deactivate_page handles file pages while THP
handles only anon pages.
Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-09 11:04     ` Michel Lespinasse
  2011-08-09 22:22       ` Minchan Kim
@ 2011-08-12 15:36       ` Andrea Arcangeli
  2011-08-12 16:08         ` SPAM: " Paul E. McKenney
                           ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-08-12 15:36 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Minchan Kim, Hugh Dickins, linux-mm, Andrew Morton,
	Paul E. McKenney, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Tue, Aug 09, 2011 at 04:04:21AM -0700, Michel Lespinasse wrote:
> - Use my proposed page count lock in order to avoid the race. One
> would have to convert all get_page_unless_zero() sites to use it. I
> expect the cost would be low but still measurable.

I didn't yet focus at your problem after we talked about it at MM
summit, but I seem to recall I suggested there to just get to the head
page and always take the lock on it. split_huge_page only works at 2M
aligned pages, the rest you don't care about. Getting to the head page
compound_lock should be always safe. And that will still scale
incredibly better than taking the lru_lock for the whole zone (which
would also work). And it seems the best way to stop split_huge_page
without having to alter the put_page fast path when it works on head
pages (the only thing that gets into put_page complex slow path is the
release of tail pages after get_user_pages* so it'd be nice if
put_page fast path still didn't need to take locks).

> - It'd be sweet if one could somehow record the time a THP page was
> created, and wait for at least one RCU grace period *starting from the
> recorded THP creation time* before splitting huge pages. In practice,
> we would be very unlikely to have to wait since the grace period would
> be already expired. However, I don't think RCU currently provides such
> a mechanism - Paul, is this something that would seem easy to
> implement or not ?

This looks sweet. We could store a quiescent points generation counter
in the page[1].something, if the page has the same generation of the
last RCU quiescent point (vs rcu_read_lock) we synchronize_rcu before
starting split_huge_page. split_huge_page is serialized through the
anon_vma lock however, so we'd need to release the anon_vma lock,
synchronize_rcu and retry and this time the page[1].something sequence
counter would be older than the rcu generation counter and it'll
proceed (maybe another thread or process will get there first but
that's ok).

I didn't have better ideas than yours above, but I'll keep thinking.

> > When I make deactivate_page, I didn't consider that honestly.
> > IMHO, It shouldn't be a problem as deactive_page hold a reference
> > of page by pagevec_lookup so the page shouldn't be gone under us.
> 
> Agree - it seems like you are guaranteed to already hold a reference
> (but then a straight get_page should be sufficient, right ?)

I hope this is not an issue because of the fact the page is guaranteed
not to be THP when get_page_unless_zero runs on it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* SPAM:  Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 15:36       ` Andrea Arcangeli
@ 2011-08-12 16:08         ` Paul E. McKenney
  2011-08-12 16:43           ` Andrea Arcangeli
                             ` (2 more replies)
  2011-08-12 22:50         ` Michel Lespinasse
  2011-08-13  4:11         ` Minchan Kim
  2 siblings, 3 replies; 34+ messages in thread
From: Paul E. McKenney @ 2011-08-12 16:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michel Lespinasse, Minchan Kim, Hugh Dickins, linux-mm,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 05:36:16PM +0200, Andrea Arcangeli wrote:
> On Tue, Aug 09, 2011 at 04:04:21AM -0700, Michel Lespinasse wrote:
> > - Use my proposed page count lock in order to avoid the race. One
> > would have to convert all get_page_unless_zero() sites to use it. I
> > expect the cost would be low but still measurable.
> 
> I didn't yet focus at your problem after we talked about it at MM
> summit, but I seem to recall I suggested there to just get to the head
> page and always take the lock on it. split_huge_page only works at 2M
> aligned pages, the rest you don't care about. Getting to the head page
> compound_lock should be always safe. And that will still scale
> incredibly better than taking the lru_lock for the whole zone (which
> would also work). And it seems the best way to stop split_huge_page
> without having to alter the put_page fast path when it works on head
> pages (the only thing that gets into put_page complex slow path is the
> release of tail pages after get_user_pages* so it'd be nice if
> put_page fast path still didn't need to take locks).
> 
> > - It'd be sweet if one could somehow record the time a THP page was
> > created, and wait for at least one RCU grace period *starting from the
> > recorded THP creation time* before splitting huge pages. In practice,
> > we would be very unlikely to have to wait since the grace period would
> > be already expired. However, I don't think RCU currently provides such
> > a mechanism - Paul, is this something that would seem easy to
> > implement or not ?

It should not be hard.  I already have an API for rcutorture testing
use, but it is not appropriate for your use because it is unsynchronized.

We need to be careful with what I give you and how you interpret it.
The most effective approach would be for me to give you an API that
filled in a cookie given a pointer to one, then another API that took
pointers to a pair of cookies and returned saying whether or not a
grace period had elapsed.  You would do something like the following:

	rcu_get_gp_cookie(&pagep->rcucookie);
	. . .

	rcu_get_gp_cookie(&autovarcookie);
	if (!rcu_cookie_gp_elapsed(&pagep->rcucookie, &autovarcookie))
		synchronize_rcu();

So, how much space do I get for ->rcucookie?  By default, it is a pair
of unsigned longs, but I could live with as small as a single byte if
you didn't mind a high probability of false negatives (me telling you
to do a grace period despite 16 of them having happened in the meantime
due to overflow of a 4-bit field in the byte).

That covers TREE_RCU and TREE_PREEMPT_RCU, on to TINY_RCU and TINY_PREEMPT_RCU.

TINY_RCU will require more thought, as it doesn't bother counting grace
periods.  Ah, but in TINY_RCU, synchronize_rcu() is free, so I simply
make rcu_cookie_gp_elapsed() always return false.

OK, TINY_PREEMPT_RCU...  It doesn't count grace periods, either.  But it
is able to reliably detect if there are any RCU readers in flight,
and there normally won't be, so synchronize_rcu() is again free in the
common case.  And no, I don't want to count grace periods as this would
increase the memory footprint.  And the whole point of TINY_PREEMPT_RCU
is to be tiny, after all.  ;-)

If you need SRCU, you are out of luck until I get my act together and
merge it in with the other RCU implementations, which might be awhile
still.

For TREE_*RCU, the calls to rcu_get_gp_cookie() will cost you a lock
round trip.  I am hoping to be able to use the counters stored in the
rcu_data structure, which means that I would need to disable preemption
and re-enable it.  Or maybe disable and re-enable irqs instead, not yet
sure which.  This might require me to be conservative and make
rcu_cookie_gp_elapsed() unless two grace periods have elapsed.  Things
get a bit tricky -- yes, I could just use the global counters, but that
would mean that rcu_get_gp_cookie() would need to acquire a global lock,
and I suspect that you intend to invoke it too often for that to be
a winning strategy.

Thoughts?  And how many bits do I get for the cookie?

							Thanx, Paul

> This looks sweet. We could store a quiescent points generation counter
> in the page[1].something, if the page has the same generation of the
> last RCU quiescent point (vs rcu_read_lock) we synchronize_rcu before
> starting split_huge_page. split_huge_page is serialized through the
> anon_vma lock however, so we'd need to release the anon_vma lock,
> synchronize_rcu and retry and this time the page[1].something sequence
> counter would be older than the rcu generation counter and it'll
> proceed (maybe another thread or process will get there first but
> that's ok).
> 
> I didn't have better ideas than yours above, but I'll keep thinking.
> 
> > > When I make deactivate_page, I didn't consider that honestly.
> > > IMHO, It shouldn't be a problem as deactive_page hold a reference
> > > of page by pagevec_lookup so the page shouldn't be gone under us.
> > 
> > Agree - it seems like you are guaranteed to already hold a reference
> > (but then a straight get_page should be sufficient, right ?)
> 
> I hope this is not an issue because of the fact the page is guaranteed
> not to be THP when get_page_unless_zero runs on it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 16:08         ` SPAM: " Paul E. McKenney
@ 2011-08-12 16:43           ` Andrea Arcangeli
  2011-08-12 17:27             ` Paul E. McKenney
  2011-08-13  4:18             ` Minchan Kim
  2011-08-12 16:57           ` Johannes Weiner
  2011-08-12 23:02           ` Michel Lespinasse
  2 siblings, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-08-12 16:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michel Lespinasse, Minchan Kim, Hugh Dickins, linux-mm,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 09:08:13AM -0700, Paul E. McKenney wrote:
> It should not be hard.  I already have an API for rcutorture testing
> use, but it is not appropriate for your use because it is unsynchronized.

Sounds good.

> We need to be careful with what I give you and how you interpret it.
> The most effective approach would be for me to give you an API that
> filled in a cookie given a pointer to one, then another API that took
> pointers to a pair of cookies and returned saying whether or not a
> grace period had elapsed.  You would do something like the following:

Even a raw number of events is ok, but it will work like a cookie.

> 	rcu_get_gp_cookie(&pagep->rcucookie);
> 	. . .
> 
> 	rcu_get_gp_cookie(&autovarcookie);
> 	if (!rcu_cookie_gp_elapsed(&pagep->rcucookie, &autovarcookie))
> 		synchronize_rcu();
> 
> So, how much space do I get for ->rcucookie?  By default, it is a pair
> of unsigned longs, but I could live with as small as a single byte if
> you didn't mind a high probability of false negatives (me telling you
> to do a grace period despite 16 of them having happened in the meantime
> due to overflow of a 4-bit field in the byte).

It could be 2 longs just fine (so it's 64bit on 32bit too and guarantees
no false positive as it'll never overflow for the lifetime of the
hardware), we've tons of free space to use in page[1-511].* .

I'm currently unsure how the cookie can be allowed to be smaller than
the real counter though. I don't see how is it possible.

> That covers TREE_RCU and TREE_PREEMPT_RCU, on to TINY_RCU and TINY_PREEMPT_RCU.
> 
> TINY_RCU will require more thought, as it doesn't bother counting grace
> periods.  Ah, but in TINY_RCU, synchronize_rcu() is free, so I simply
> make rcu_cookie_gp_elapsed() always return false.

Yes it'll surely be safe for us, on UP we have no race and in fact
get_page_unless_zero isn't even called in the speculative lookup in UP. With
the code above you could return always true with TINY_RCU and skip the
call.

> OK, TINY_PREEMPT_RCU...  It doesn't count grace periods, either.  But it
> is able to reliably detect if there are any RCU readers in flight,
> and there normally won't be, so synchronize_rcu() is again free in the
> common case.  And no, I don't want to count grace periods as this would
> increase the memory footprint.  And the whole point of TINY_PREEMPT_RCU
> is to be tiny, after all.  ;-)

Ok so it returns always false, and synchronize_rcu is always called,
but it will normally do nothing there.

> If you need SRCU, you are out of luck until I get my act together and
> merge it in with the other RCU implementations, which might be awhile
> still.

Good luck because we don't need SRCU, we just need a synchronize_rcu
vs rcu_read_lock.

> For TREE_*RCU, the calls to rcu_get_gp_cookie() will cost you a lock
> round trip.  I am hoping to be able to use the counters stored in the
> rcu_data structure, which means that I would need to disable preemption
> and re-enable it.  Or maybe disable and re-enable irqs instead, not yet
> sure which.  This might require me to be conservative and make
> rcu_cookie_gp_elapsed() unless two grace periods have elapsed.  Things
> get a bit tricky -- yes, I could just use the global counters, but that
> would mean that rcu_get_gp_cookie() would need to acquire a global lock,
> and I suspect that you intend to invoke it too often for that to be
> a winning strategy.

It is invoked at every page allocation, there are some locks taken
there already but they're per-mm (mm->page_table_lock). I'd be nice if
we could run it without taking locks.

If we make it a raw unsigned long long we read it in order (first lower
bits, then higher bits on 32bit) and store it in the opposite
direction (first increment the higher part, then increment the lower
part or reset it to 0), can't we avoid all the locks and worst case we
get a false positive when we compare?

> Thoughts?  And how many bits do I get for the cookie?

As many as you want.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 16:08         ` SPAM: " Paul E. McKenney
  2011-08-12 16:43           ` Andrea Arcangeli
@ 2011-08-12 16:57           ` Johannes Weiner
  2011-08-12 17:08             ` Andrea Arcangeli
  2011-08-12 17:41             ` Paul E. McKenney
  2011-08-12 23:02           ` Michel Lespinasse
  2 siblings, 2 replies; 34+ messages in thread
From: Johannes Weiner @ 2011-08-12 16:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Arcangeli, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 09:08:13AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 12, 2011 at 05:36:16PM +0200, Andrea Arcangeli wrote:
> > On Tue, Aug 09, 2011 at 04:04:21AM -0700, Michel Lespinasse wrote:
> > > - Use my proposed page count lock in order to avoid the race. One
> > > would have to convert all get_page_unless_zero() sites to use it. I
> > > expect the cost would be low but still measurable.
> > 
> > I didn't yet focus at your problem after we talked about it at MM
> > summit, but I seem to recall I suggested there to just get to the head
> > page and always take the lock on it. split_huge_page only works at 2M
> > aligned pages, the rest you don't care about. Getting to the head page
> > compound_lock should be always safe. And that will still scale
> > incredibly better than taking the lru_lock for the whole zone (which
> > would also work). And it seems the best way to stop split_huge_page
> > without having to alter the put_page fast path when it works on head
> > pages (the only thing that gets into put_page complex slow path is the
> > release of tail pages after get_user_pages* so it'd be nice if
> > put_page fast path still didn't need to take locks).
> > 
> > > - It'd be sweet if one could somehow record the time a THP page was
> > > created, and wait for at least one RCU grace period *starting from the
> > > recorded THP creation time* before splitting huge pages. In practice,
> > > we would be very unlikely to have to wait since the grace period would
> > > be already expired. However, I don't think RCU currently provides such
> > > a mechanism - Paul, is this something that would seem easy to
> > > implement or not ?
> 
> It should not be hard.  I already have an API for rcutorture testing
> use, but it is not appropriate for your use because it is unsynchronized.
> 
> We need to be careful with what I give you and how you interpret it.
> The most effective approach would be for me to give you an API that
> filled in a cookie given a pointer to one, then another API that took
> pointers to a pair of cookies and returned saying whether or not a
> grace period had elapsed.  You would do something like the following:
> 
> 	rcu_get_gp_cookie(&pagep->rcucookie);
> 	. . .
> 
> 	rcu_get_gp_cookie(&autovarcookie);
> 	if (!rcu_cookie_gp_elapsed(&pagep->rcucookie, &autovarcookie))
> 		synchronize_rcu();
> 
> So, how much space do I get for ->rcucookie?  By default, it is a pair
> of unsigned longs, but I could live with as small as a single byte if
> you didn't mind a high probability of false negatives (me telling you
> to do a grace period despite 16 of them having happened in the meantime
> due to overflow of a 4-bit field in the byte).
> 
> That covers TREE_RCU and TREE_PREEMPT_RCU, on to TINY_RCU and TINY_PREEMPT_RCU.
> 
> TINY_RCU will require more thought, as it doesn't bother counting grace
> periods.  Ah, but in TINY_RCU, synchronize_rcu() is free, so I simply
> make rcu_cookie_gp_elapsed() always return false.
> 
> OK, TINY_PREEMPT_RCU...  It doesn't count grace periods, either.  But it
> is able to reliably detect if there are any RCU readers in flight,
> and there normally won't be, so synchronize_rcu() is again free in the
> common case.  And no, I don't want to count grace periods as this would
> increase the memory footprint.  And the whole point of TINY_PREEMPT_RCU
> is to be tiny, after all.  ;-)

I understand you want to be careful with the promises you make in the
API.  How about not even exposing the check for whether a grace period
elapsed, but instead provide a specialized synchronize_rcu()?

Something like

	void synchronize_rcu_with(rcu_time_t time)

that only promises all readers from the specified time are finished.

[ And synchronize_rcu() would be equivalent to
  synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]

Then you wouldn't need to worry about how the return value of
rcu_cookie_gp_elapsed() might be interpreted, could freely implement
it equal to synchronize_rcu() on TINY_RCU, the false positives with
small cookies would not be about correctness but merely performance.

And it should still be all that which the THP case requires.

Would that work?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-05  6:39 ` [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
  2011-08-07 14:25   ` Minchan Kim
@ 2011-08-12 16:58   ` Andrea Arcangeli
  1 sibling, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-08-12 16:58 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Hugh Dickins, linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, KOSAKI Motohiro, Shaohua Li

On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote:
> On Thu, Aug 4, 2011 at 2:07 PM, Michel Lespinasse <walken@google.com> wrote:
> > Patch 3 demonstrates my motivation for this patch series: in my pre-THP
> > implementation of idle page tracking, I was able to use get_page_unless_zero
> > in a way that __split_huge_page_refcount made unsafe. Building on top of
> > patch 2, I can make the required operation safe again. If patch 2 was to
> > be rejected, I would like to get suggestions about alternative approaches
> > to implement the get_first_page_unless_zero() operation described here.
> 
> I should add that I am quite worried about the places that use
> get_page_unless_zero (or the page_cache_*_speculative wrappers) today.
> My worrisome scenario would be as follows:
> 
> - thread T finds a pointer to a page P (possibly from a radix tree in
> find_get_page() )
> - page P gets freed by another thread
> - page P gets re-allocated as the tail of a THP page by another thread
> - another thread gets a reference on page P

BTW, this isn't just a regular reference, it can only be
get_user_pages* on a tail page, so this further reduces the chance of
some order of magnitude. If was just any reference it'd be more
probable, but the time it takes to allocate the page and then some
other user to get_user_page it, sounds always slower than the
rcu_read_lock section. If irq were disabled it'd be mostly guaranteed
(but even if irqs were disabled I would totally agree in fixing this,
I would never like to depend on timings to be 100% safe).  But in this
case irq are enabled so a long irq may allow it to happen, we never
know.

> - thread T proceeds doing page_cache_get_speculative(P), intending to
> then check that P is really the page it wanted
> - another thread splits up P's compound page;
> __split_huge_page_refcount subtracts T's refcount on P from head(P)'s
> refcount
> - thread T figures out that it didn't get the page it expected, calls
> page_cache_release(P). But it's too late - the refcount for what used
> to be head(P) has already been corrupted (incorrectly decremented).

And great spotting indeed :)

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 16:57           ` Johannes Weiner
@ 2011-08-12 17:08             ` Andrea Arcangeli
  2011-08-12 17:52               ` Johannes Weiner
  2011-08-12 18:03               ` Paul E. McKenney
  2011-08-12 17:41             ` Paul E. McKenney
  1 sibling, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-08-12 17:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Paul E. McKenney, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 06:57:49PM +0200, Johannes Weiner wrote:
> I understand you want to be careful with the promises you make in the
> API.  How about not even exposing the check for whether a grace period
> elapsed, but instead provide a specialized synchronize_rcu()?
> 
> Something like
> 
> 	void synchronize_rcu_with(rcu_time_t time)
> 
> that only promises all readers from the specified time are finished.
> 
> [ And synchronize_rcu() would be equivalent to
>   synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]
> 
> Then you wouldn't need to worry about how the return value of
> rcu_cookie_gp_elapsed() might be interpreted, could freely implement
> it equal to synchronize_rcu() on TINY_RCU, the false positives with
> small cookies would not be about correctness but merely performance.
> 
> And it should still be all that which the THP case requires.
> 
> Would that work?

rcu_time_t would still be an unsigned long long like I suggested?

About the false positives thing, I failed to see how it's ever
possible to return only false positives and never false negatives when
cookie and internal counter are not of the same size (and cookie has
no enough bits to ever tell if it overflowed or not).

I think rcu_generation_t is more appropriate because it's not time but
a generation/sequence counter.

The ideally the comparison check would be just an inline function
reading 2 longs in reverse order in 32bit and comparing it with the
stable value we have in page[1]->something_low and
page[1]->something_high , so skipping an external call sounds better
to me, but the above should also work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 16:43           ` Andrea Arcangeli
@ 2011-08-12 17:27             ` Paul E. McKenney
  2011-08-12 23:45               ` Michel Lespinasse
  2011-08-13  4:18             ` Minchan Kim
  1 sibling, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2011-08-12 17:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michel Lespinasse, Minchan Kim, Hugh Dickins, linux-mm,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 06:43:25PM +0200, Andrea Arcangeli wrote:
> On Fri, Aug 12, 2011 at 09:08:13AM -0700, Paul E. McKenney wrote:
> > It should not be hard.  I already have an API for rcutorture testing
> > use, but it is not appropriate for your use because it is unsynchronized.
> 
> Sounds good.
> 
> > We need to be careful with what I give you and how you interpret it.
> > The most effective approach would be for me to give you an API that
> > filled in a cookie given a pointer to one, then another API that took
> > pointers to a pair of cookies and returned saying whether or not a
> > grace period had elapsed.  You would do something like the following:
> 
> Even a raw number of events is ok, but it will work like a cookie.

Cookie would be best.  ;-)

> > 	rcu_get_gp_cookie(&pagep->rcucookie);
> > 	. . .
> > 
> > 	rcu_get_gp_cookie(&autovarcookie);
> > 	if (!rcu_cookie_gp_elapsed(&pagep->rcucookie, &autovarcookie))
> > 		synchronize_rcu();
> > 
> > So, how much space do I get for ->rcucookie?  By default, it is a pair
> > of unsigned longs, but I could live with as small as a single byte if
> > you didn't mind a high probability of false negatives (me telling you
> > to do a grace period despite 16 of them having happened in the meantime
> > due to overflow of a 4-bit field in the byte).
> 
> It could be 2 longs just fine (so it's 64bit on 32bit too and guarantees
> no false positive as it'll never overflow for the lifetime of the
> hardware), we've tons of free space to use in page[1-511].* .

Very good!!!  

> I'm currently unsure how the cookie can be allowed to be smaller than
> the real counter though. I don't see how is it possible.

Just put the low-order bits of the counter in the byte.  This could
cause rcu_cookie_gp_elapsed() to get confused, but the only possible
confusion would be for it to make you do synchronize_rcu() when it
wasn't necessary to do so.  This is a performance problem rather than
a correctness problem, though.

But the more bits you give me, the lower the probability of needless
calls to synchronize_rcu().

> > That covers TREE_RCU and TREE_PREEMPT_RCU, on to TINY_RCU and TINY_PREEMPT_RCU.
> > 
> > TINY_RCU will require more thought, as it doesn't bother counting grace
> > periods.  Ah, but in TINY_RCU, synchronize_rcu() is free, so I simply
> > make rcu_cookie_gp_elapsed() always return false.
> 
> Yes it'll surely be safe for us, on UP we have no race and in fact
> get_page_unless_zero isn't even called in the speculative lookup in UP. With
> the code above you could return always true with TINY_RCU and skip the
> call.

Or maybe I make rcu_cookie_gp_elapsed() take only one cookie and
compare it to the current cookie.  This would save a bit of code in
the TINY cases:

	rcu_get_gp_cookie(&pagep->rcucookie);
	. . .

	if (!rcu_cookie_gp_elapsed(&pagep->rcucookie))
		synchronize_rcu();

The compiler should then be able to recognize synchronize_rcu() as dead
code in the TINY case.

The main downside of this approach is that you couldn't check for two
past points having a grace period between them, but I don't see a use
case for this right offhand.

> > OK, TINY_PREEMPT_RCU...  It doesn't count grace periods, either.  But it
> > is able to reliably detect if there are any RCU readers in flight,
> > and there normally won't be, so synchronize_rcu() is again free in the
> > common case.  And no, I don't want to count grace periods as this would
> > increase the memory footprint.  And the whole point of TINY_PREEMPT_RCU
> > is to be tiny, after all.  ;-)
> 
> Ok so it returns always false, and synchronize_rcu is always called,
> but it will normally do nothing there.

Sounds good.

> > If you need SRCU, you are out of luck until I get my act together and
> > merge it in with the other RCU implementations, which might be awhile
> > still.
> 
> Good luck because we don't need SRCU, we just need a synchronize_rcu
> vs rcu_read_lock.

Whew!  ;-)

> > For TREE_*RCU, the calls to rcu_get_gp_cookie() will cost you a lock
> > round trip.  I am hoping to be able to use the counters stored in the
> > rcu_data structure, which means that I would need to disable preemption
> > and re-enable it.  Or maybe disable and re-enable irqs instead, not yet
> > sure which.  This might require me to be conservative and make
> > rcu_cookie_gp_elapsed() unless two grace periods have elapsed.  Things
> > get a bit tricky -- yes, I could just use the global counters, but that
> > would mean that rcu_get_gp_cookie() would need to acquire a global lock,
> > and I suspect that you intend to invoke it too often for that to be
> > a winning strategy.
> 
> It is invoked at every page allocation, there are some locks taken
> there already but they're per-mm (mm->page_table_lock). I'd be nice if
> we could run it without taking locks.

OK, so I should try hard to make rcu_get_gp_cookie() access the per-CPU
rcu_data structure, then.

How long would there normally be between recording the cookie and
checking for the need for a grace period?  One disk access?  One HZ?
Something else?

> If we make it a raw unsigned long long we read it in order (first lower
> bits, then higher bits on 32bit) and store it in the opposite
> direction (first increment the higher part, then increment the lower
> part or reset it to 0), can't we avoid all the locks and worst case we
> get a false positive when we compare?

If I can make use of the values in the per-CPU rcu_data structure, no
locks are required.  Might need to disable preemption and/or interrupts,
but nothing beyond that.

> > Thoughts?  And how many bits do I get for the cookie?
> 
> As many as you want.

Woo-hoo!!!  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 16:57           ` Johannes Weiner
  2011-08-12 17:08             ` Andrea Arcangeli
@ 2011-08-12 17:41             ` Paul E. McKenney
  2011-08-12 17:56               ` Johannes Weiner
  1 sibling, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2011-08-12 17:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 06:57:49PM +0200, Johannes Weiner wrote:
> On Fri, Aug 12, 2011 at 09:08:13AM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 12, 2011 at 05:36:16PM +0200, Andrea Arcangeli wrote:
> > > On Tue, Aug 09, 2011 at 04:04:21AM -0700, Michel Lespinasse wrote:
> > > > - Use my proposed page count lock in order to avoid the race. One
> > > > would have to convert all get_page_unless_zero() sites to use it. I
> > > > expect the cost would be low but still measurable.
> > > 
> > > I didn't yet focus at your problem after we talked about it at MM
> > > summit, but I seem to recall I suggested there to just get to the head
> > > page and always take the lock on it. split_huge_page only works at 2M
> > > aligned pages, the rest you don't care about. Getting to the head page
> > > compound_lock should be always safe. And that will still scale
> > > incredibly better than taking the lru_lock for the whole zone (which
> > > would also work). And it seems the best way to stop split_huge_page
> > > without having to alter the put_page fast path when it works on head
> > > pages (the only thing that gets into put_page complex slow path is the
> > > release of tail pages after get_user_pages* so it'd be nice if
> > > put_page fast path still didn't need to take locks).
> > > 
> > > > - It'd be sweet if one could somehow record the time a THP page was
> > > > created, and wait for at least one RCU grace period *starting from the
> > > > recorded THP creation time* before splitting huge pages. In practice,
> > > > we would be very unlikely to have to wait since the grace period would
> > > > be already expired. However, I don't think RCU currently provides such
> > > > a mechanism - Paul, is this something that would seem easy to
> > > > implement or not ?
> > 
> > It should not be hard.  I already have an API for rcutorture testing
> > use, but it is not appropriate for your use because it is unsynchronized.
> > 
> > We need to be careful with what I give you and how you interpret it.
> > The most effective approach would be for me to give you an API that
> > filled in a cookie given a pointer to one, then another API that took
> > pointers to a pair of cookies and returned saying whether or not a
> > grace period had elapsed.  You would do something like the following:
> > 
> > 	rcu_get_gp_cookie(&pagep->rcucookie);
> > 	. . .
> > 
> > 	rcu_get_gp_cookie(&autovarcookie);
> > 	if (!rcu_cookie_gp_elapsed(&pagep->rcucookie, &autovarcookie))
> > 		synchronize_rcu();
> > 
> > So, how much space do I get for ->rcucookie?  By default, it is a pair
> > of unsigned longs, but I could live with as small as a single byte if
> > you didn't mind a high probability of false negatives (me telling you
> > to do a grace period despite 16 of them having happened in the meantime
> > due to overflow of a 4-bit field in the byte).
> > 
> > That covers TREE_RCU and TREE_PREEMPT_RCU, on to TINY_RCU and TINY_PREEMPT_RCU.
> > 
> > TINY_RCU will require more thought, as it doesn't bother counting grace
> > periods.  Ah, but in TINY_RCU, synchronize_rcu() is free, so I simply
> > make rcu_cookie_gp_elapsed() always return false.
> > 
> > OK, TINY_PREEMPT_RCU...  It doesn't count grace periods, either.  But it
> > is able to reliably detect if there are any RCU readers in flight,
> > and there normally won't be, so synchronize_rcu() is again free in the
> > common case.  And no, I don't want to count grace periods as this would
> > increase the memory footprint.  And the whole point of TINY_PREEMPT_RCU
> > is to be tiny, after all.  ;-)
> 
> I understand you want to be careful with the promises you make in the
> API.  How about not even exposing the check for whether a grace period
> elapsed, but instead provide a specialized synchronize_rcu()?
> 
> Something like
> 
> 	void synchronize_rcu_with(rcu_time_t time)
> 
> that only promises all readers from the specified time are finished.
> 
> [ And synchronize_rcu() would be equivalent to
>   synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]
> 
> Then you wouldn't need to worry about how the return value of
> rcu_cookie_gp_elapsed() might be interpreted, could freely implement
> it equal to synchronize_rcu() on TINY_RCU, the false positives with
> small cookies would not be about correctness but merely performance.
> 
> And it should still be all that which the THP case requires.
> 
> Would that work?

I currently don't record the times at which past grace periods start
and finish, but you can think of the cookie I was proposing as being a
specialized timestamp that measures the passage of time in terms of the
number of grace periods that have started and finished.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 17:08             ` Andrea Arcangeli
@ 2011-08-12 17:52               ` Johannes Weiner
  2011-08-12 18:13                 ` Paul E. McKenney
  2011-08-12 22:22                 ` Andrea Arcangeli
  2011-08-12 18:03               ` Paul E. McKenney
  1 sibling, 2 replies; 34+ messages in thread
From: Johannes Weiner @ 2011-08-12 17:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Paul E. McKenney, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 07:08:23PM +0200, Andrea Arcangeli wrote:
> On Fri, Aug 12, 2011 at 06:57:49PM +0200, Johannes Weiner wrote:
> > I understand you want to be careful with the promises you make in the
> > API.  How about not even exposing the check for whether a grace period
> > elapsed, but instead provide a specialized synchronize_rcu()?
> > 
> > Something like
> > 
> > 	void synchronize_rcu_with(rcu_time_t time)
> > 
> > that only promises all readers from the specified time are finished.
> > 
> > [ And synchronize_rcu() would be equivalent to
> >   synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]
> > 
> > Then you wouldn't need to worry about how the return value of
> > rcu_cookie_gp_elapsed() might be interpreted, could freely implement
> > it equal to synchronize_rcu() on TINY_RCU, the false positives with
> > small cookies would not be about correctness but merely performance.
> > 
> > And it should still be all that which the THP case requires.
> > 
> > Would that work?
> 
> rcu_time_t would still be an unsigned long long like I suggested?

Do we even need to make this fixed?  It can be unsigned long long for
now, but I could imagine leaving it up to the user depending how much
space she is able/willing to invest to save time:

	void synchronize_rcu_with(unsigned long time, unsigned int bits)
	{
		if (generation_counter & ((1 << bits) - 1) == time)
			synchronize_rcu();
	}

If you have only 3 bits to store the time, you will synchronize
falsely to every 8th phase.  Better than nothing, right?

> About the false positives thing, I failed to see how it's ever
> possible to return only false positives and never false negatives when
> cookie and internal counter are not of the same size (and cookie has
> no enough bits to ever tell if it overflowed or not).

I don't see how.  Even with one bit for the time stamp you get every
second generation right :-)

> I think rcu_generation_t is more appropriate because it's not time but
> a generation/sequence counter.

I intentionally chose a vague name as the unit should be irrelevant to
the outside world.  But I don't feel strongly about this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 17:41             ` Paul E. McKenney
@ 2011-08-12 17:56               ` Johannes Weiner
  0 siblings, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2011-08-12 17:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Arcangeli, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 10:41:36AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 12, 2011 at 06:57:49PM +0200, Johannes Weiner wrote:
> > On Fri, Aug 12, 2011 at 09:08:13AM -0700, Paul E. McKenney wrote:
> > > On Fri, Aug 12, 2011 at 05:36:16PM +0200, Andrea Arcangeli wrote:
> > > > On Tue, Aug 09, 2011 at 04:04:21AM -0700, Michel Lespinasse wrote:
> > > > > - Use my proposed page count lock in order to avoid the race. One
> > > > > would have to convert all get_page_unless_zero() sites to use it. I
> > > > > expect the cost would be low but still measurable.
> > > > 
> > > > I didn't yet focus at your problem after we talked about it at MM
> > > > summit, but I seem to recall I suggested there to just get to the head
> > > > page and always take the lock on it. split_huge_page only works at 2M
> > > > aligned pages, the rest you don't care about. Getting to the head page
> > > > compound_lock should be always safe. And that will still scale
> > > > incredibly better than taking the lru_lock for the whole zone (which
> > > > would also work). And it seems the best way to stop split_huge_page
> > > > without having to alter the put_page fast path when it works on head
> > > > pages (the only thing that gets into put_page complex slow path is the
> > > > release of tail pages after get_user_pages* so it'd be nice if
> > > > put_page fast path still didn't need to take locks).
> > > > 
> > > > > - It'd be sweet if one could somehow record the time a THP page was
> > > > > created, and wait for at least one RCU grace period *starting from the
> > > > > recorded THP creation time* before splitting huge pages. In practice,
> > > > > we would be very unlikely to have to wait since the grace period would
> > > > > be already expired. However, I don't think RCU currently provides such
> > > > > a mechanism - Paul, is this something that would seem easy to
> > > > > implement or not ?
> > > 
> > > It should not be hard.  I already have an API for rcutorture testing
> > > use, but it is not appropriate for your use because it is unsynchronized.
> > > 
> > > We need to be careful with what I give you and how you interpret it.
> > > The most effective approach would be for me to give you an API that
> > > filled in a cookie given a pointer to one, then another API that took
> > > pointers to a pair of cookies and returned saying whether or not a
> > > grace period had elapsed.  You would do something like the following:
> > > 
> > > 	rcu_get_gp_cookie(&pagep->rcucookie);
> > > 	. . .
> > > 
> > > 	rcu_get_gp_cookie(&autovarcookie);
> > > 	if (!rcu_cookie_gp_elapsed(&pagep->rcucookie, &autovarcookie))
> > > 		synchronize_rcu();
> > > 
> > > So, how much space do I get for ->rcucookie?  By default, it is a pair
> > > of unsigned longs, but I could live with as small as a single byte if
> > > you didn't mind a high probability of false negatives (me telling you
> > > to do a grace period despite 16 of them having happened in the meantime
> > > due to overflow of a 4-bit field in the byte).
> > > 
> > > That covers TREE_RCU and TREE_PREEMPT_RCU, on to TINY_RCU and TINY_PREEMPT_RCU.
> > > 
> > > TINY_RCU will require more thought, as it doesn't bother counting grace
> > > periods.  Ah, but in TINY_RCU, synchronize_rcu() is free, so I simply
> > > make rcu_cookie_gp_elapsed() always return false.
> > > 
> > > OK, TINY_PREEMPT_RCU...  It doesn't count grace periods, either.  But it
> > > is able to reliably detect if there are any RCU readers in flight,
> > > and there normally won't be, so synchronize_rcu() is again free in the
> > > common case.  And no, I don't want to count grace periods as this would
> > > increase the memory footprint.  And the whole point of TINY_PREEMPT_RCU
> > > is to be tiny, after all.  ;-)
> > 
> > I understand you want to be careful with the promises you make in the
> > API.  How about not even exposing the check for whether a grace period
> > elapsed, but instead provide a specialized synchronize_rcu()?
> > 
> > Something like
> > 
> > 	void synchronize_rcu_with(rcu_time_t time)
> > 
> > that only promises all readers from the specified time are finished.
> > 
> > [ And synchronize_rcu() would be equivalent to
> >   synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]
> > 
> > Then you wouldn't need to worry about how the return value of
> > rcu_cookie_gp_elapsed() might be interpreted, could freely implement
> > it equal to synchronize_rcu() on TINY_RCU, the false positives with
> > small cookies would not be about correctness but merely performance.
> > 
> > And it should still be all that which the THP case requires.
> > 
> > Would that work?
> 
> I currently don't record the times at which past grace periods start
> and finish, but you can think of the cookie I was proposing as being a
> specialized timestamp that measures the passage of time in terms of the
> number of grace periods that have started and finished.  ;-)

Oh, absolutely, that is what I meant.  Sorry if it wasn't clear.

My proposal was more about the function interface, not the unit of the
cookies that are passed around.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 17:08             ` Andrea Arcangeli
  2011-08-12 17:52               ` Johannes Weiner
@ 2011-08-12 18:03               ` Paul E. McKenney
  1 sibling, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2011-08-12 18:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 07:08:23PM +0200, Andrea Arcangeli wrote:
> On Fri, Aug 12, 2011 at 06:57:49PM +0200, Johannes Weiner wrote:
> > I understand you want to be careful with the promises you make in the
> > API.  How about not even exposing the check for whether a grace period
> > elapsed, but instead provide a specialized synchronize_rcu()?
> > 
> > Something like
> > 
> > 	void synchronize_rcu_with(rcu_time_t time)
> > 
> > that only promises all readers from the specified time are finished.
> > 
> > [ And synchronize_rcu() would be equivalent to
> >   synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]
> > 
> > Then you wouldn't need to worry about how the return value of
> > rcu_cookie_gp_elapsed() might be interpreted, could freely implement
> > it equal to synchronize_rcu() on TINY_RCU, the false positives with
> > small cookies would not be about correctness but merely performance.
> > 
> > And it should still be all that which the THP case requires.
> > 
> > Would that work?
> 
> rcu_time_t would still be an unsigned long long like I suggested?

NACK.  I really really really don't want to have to deal with time
synchronization issues.  ;-)

"Yes, your memory was corrupted due to the CPU's clocks going slightly
out of synchronization."  Thank you, but no thank you!

> About the false positives thing, I failed to see how it's ever
> possible to return only false positives and never false negatives when
> cookie and internal counter are not of the same size (and cookie has
> no enough bits to ever tell if it overflowed or not).

Easy.  I say "yes" if I can prove based on the counter values that an RCU
grace period elapsed, and "no" if there is any uncertainty.  Any overflow
means lots of RCU grace periods, and so if the counters haven't changed
enough, I just say "no".  I might be saying "no" wrongly on overflow,
but that is the safe mistake to make.

> I think rcu_generation_t is more appropriate because it's not time but
> a generation/sequence counter.

Sounds like a better name than the rcu_cookie_t I was thinking of,
so happy to steal your naming idea.  ;-)

> The ideally the comparison check would be just an inline function
> reading 2 longs in reverse order in 32bit and comparing it with the
> stable value we have in page[1]->something_low and
> page[1]->something_high , so skipping an external call sounds better
> to me, but the above should also work.

The current #include structure forces an external call for
TREE_RCU and TREE_PREEMPT_RCU, but TINY_RCU and TINY_PREEMPT_RCU
can have an inlined call.

Here are my current thoughts on API:

	void rcu_get_gp_cookie(rcu_generation_t *rgp);

		Fills in the pointed-to RCU generation information.
		The information is opaque, and is currently a pair
		of unsigned longs (but could change in the future,
		for example, it might turn out that only one is
		required, etc.).

	bool rcu_cookie_gp_elapsed(rcu_generation_t *rgp);

		Given a pointer to RCU generation information previously
		filled in, returns true if it can prove that at least
		one RCU grace period has elapsed since then.

Seem reasonable?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 17:52               ` Johannes Weiner
@ 2011-08-12 18:13                 ` Paul E. McKenney
  2011-08-12 19:05                   ` Johannes Weiner
  2011-08-12 22:22                 ` Andrea Arcangeli
  1 sibling, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2011-08-12 18:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 07:52:06PM +0200, Johannes Weiner wrote:
> On Fri, Aug 12, 2011 at 07:08:23PM +0200, Andrea Arcangeli wrote:
> > On Fri, Aug 12, 2011 at 06:57:49PM +0200, Johannes Weiner wrote:
> > > I understand you want to be careful with the promises you make in the
> > > API.  How about not even exposing the check for whether a grace period
> > > elapsed, but instead provide a specialized synchronize_rcu()?
> > > 
> > > Something like
> > > 
> > > 	void synchronize_rcu_with(rcu_time_t time)
> > > 
> > > that only promises all readers from the specified time are finished.
> > > 
> > > [ And synchronize_rcu() would be equivalent to
> > >   synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]
> > > 
> > > Then you wouldn't need to worry about how the return value of
> > > rcu_cookie_gp_elapsed() might be interpreted, could freely implement
> > > it equal to synchronize_rcu() on TINY_RCU, the false positives with
> > > small cookies would not be about correctness but merely performance.
> > > 
> > > And it should still be all that which the THP case requires.
> > > 
> > > Would that work?
> > 
> > rcu_time_t would still be an unsigned long long like I suggested?
> 
> Do we even need to make this fixed?  It can be unsigned long long for
> now, but I could imagine leaving it up to the user depending how much
> space she is able/willing to invest to save time:
> 
> 	void synchronize_rcu_with(unsigned long time, unsigned int bits)
> 	{
> 		if (generation_counter & ((1 << bits) - 1) == time)
> 			synchronize_rcu();
> 	}

This is indeed more convenient for this particular use case, but suppose
that the caller instead wanted to use call_rcu()?  The API I am currently
proposing allows either synchronize_rcu() or call_rcu() to be used.  In
addition, it allows alternative algorithms, for example:

	rcu_get_gp_cookie(&wherever);

	...

	if (rcu_cookie_gp_elapsed(&wherever))
		p = old_pointer;  /* now safe to re-use. */
	else
		p = kmalloc( ... );  /* can't re-use, so get new memory. */

> If you have only 3 bits to store the time, you will synchronize
> falsely to every 8th phase.  Better than nothing, right?

;-)

> > About the false positives thing, I failed to see how it's ever
> > possible to return only false positives and never false negatives when
> > cookie and internal counter are not of the same size (and cookie has
> > no enough bits to ever tell if it overflowed or not).
> 
> I don't see how.  Even with one bit for the time stamp you get every
> second generation right :-)

I probably need at least two or three bits to account for grace-period
slew, at least if we want to avoid grabbing a global lock each time
one of these APIs is invoked.

> > I think rcu_generation_t is more appropriate because it's not time but
> > a generation/sequence counter.
> 
> I intentionally chose a vague name as the unit should be irrelevant to
> the outside world.  But I don't feel strongly about this.

Yep, different RCU implementations will need different data in the
rcu_generation_t.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 18:13                 ` Paul E. McKenney
@ 2011-08-12 19:05                   ` Johannes Weiner
  2011-08-12 22:14                     ` Paul E. McKenney
  0 siblings, 1 reply; 34+ messages in thread
From: Johannes Weiner @ 2011-08-12 19:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Arcangeli, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 11:13:06AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 12, 2011 at 07:52:06PM +0200, Johannes Weiner wrote:
> > On Fri, Aug 12, 2011 at 07:08:23PM +0200, Andrea Arcangeli wrote:
> > > On Fri, Aug 12, 2011 at 06:57:49PM +0200, Johannes Weiner wrote:
> > > > I understand you want to be careful with the promises you make in the
> > > > API.  How about not even exposing the check for whether a grace period
> > > > elapsed, but instead provide a specialized synchronize_rcu()?
> > > > 
> > > > Something like
> > > > 
> > > > 	void synchronize_rcu_with(rcu_time_t time)
> > > > 
> > > > that only promises all readers from the specified time are finished.
> > > > 
> > > > [ And synchronize_rcu() would be equivalent to
> > > >   synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]
> > > > 
> > > > Then you wouldn't need to worry about how the return value of
> > > > rcu_cookie_gp_elapsed() might be interpreted, could freely implement
> > > > it equal to synchronize_rcu() on TINY_RCU, the false positives with
> > > > small cookies would not be about correctness but merely performance.
> > > > 
> > > > And it should still be all that which the THP case requires.
> > > > 
> > > > Would that work?
> > > 
> > > rcu_time_t would still be an unsigned long long like I suggested?
> > 
> > Do we even need to make this fixed?  It can be unsigned long long for
> > now, but I could imagine leaving it up to the user depending how much
> > space she is able/willing to invest to save time:
> > 
> > 	void synchronize_rcu_with(unsigned long time, unsigned int bits)
> > 	{
> > 		if (generation_counter & ((1 << bits) - 1) == time)
> > 			synchronize_rcu();
> > 	}
> 
> This is indeed more convenient for this particular use case, but suppose
> that the caller instead wanted to use call_rcu()?

I don't quite understand.  call_rcu() will always schedule the
callbacks for execution after a grace period.  So the only use case I
can see--executing the callback ASAP as the required grace period has
already elapsed--would still require an extra argument to call_rcu()
for it to properly schedule the callback, no?  I.e.

	call_rcu_after(head, func, generation)

What am I missing that would make the existing call_rcu() useful in
combination with rcu_cookie_gp_elapsed()?

> The API I am currently proposing allows either synchronize_rcu() or
> call_rcu() to be used.  In addition, it allows alternative
> algorithms, for example:
> 
> 	rcu_get_gp_cookie(&wherever);
> 
> 	...
> 
> 	if (rcu_cookie_gp_elapsed(&wherever))
> 		p = old_pointer;  /* now safe to re-use. */
> 	else
> 		p = kmalloc( ... );  /* can't re-use, so get new memory. */

I have to admit that I am not imaginative enough right now to put this
in a real life scenario.  But it does look more flexible.

Though it must be made clear that it may never return true, so
anything essential (like _freeing_ old memory) may never rely on it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 19:05                   ` Johannes Weiner
@ 2011-08-12 22:14                     ` Paul E. McKenney
  0 siblings, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2011-08-12 22:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 09:05:57PM +0200, Johannes Weiner wrote:
> On Fri, Aug 12, 2011 at 11:13:06AM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 12, 2011 at 07:52:06PM +0200, Johannes Weiner wrote:
> > > On Fri, Aug 12, 2011 at 07:08:23PM +0200, Andrea Arcangeli wrote:
> > > > On Fri, Aug 12, 2011 at 06:57:49PM +0200, Johannes Weiner wrote:
> > > > > I understand you want to be careful with the promises you make in the
> > > > > API.  How about not even exposing the check for whether a grace period
> > > > > elapsed, but instead provide a specialized synchronize_rcu()?
> > > > > 
> > > > > Something like
> > > > > 
> > > > > 	void synchronize_rcu_with(rcu_time_t time)
> > > > > 
> > > > > that only promises all readers from the specified time are finished.
> > > > > 
> > > > > [ And synchronize_rcu() would be equivalent to
> > > > >   synchronize_rcu_with(rcu_current_time()) if I am not mistaken. ]
> > > > > 
> > > > > Then you wouldn't need to worry about how the return value of
> > > > > rcu_cookie_gp_elapsed() might be interpreted, could freely implement
> > > > > it equal to synchronize_rcu() on TINY_RCU, the false positives with
> > > > > small cookies would not be about correctness but merely performance.
> > > > > 
> > > > > And it should still be all that which the THP case requires.
> > > > > 
> > > > > Would that work?
> > > > 
> > > > rcu_time_t would still be an unsigned long long like I suggested?
> > > 
> > > Do we even need to make this fixed?  It can be unsigned long long for
> > > now, but I could imagine leaving it up to the user depending how much
> > > space she is able/willing to invest to save time:
> > > 
> > > 	void synchronize_rcu_with(unsigned long time, unsigned int bits)
> > > 	{
> > > 		if (generation_counter & ((1 << bits) - 1) == time)
> > > 			synchronize_rcu();
> > > 	}
> > 
> > This is indeed more convenient for this particular use case, but suppose
> > that the caller instead wanted to use call_rcu()?
> 
> I don't quite understand.  call_rcu() will always schedule the
> callbacks for execution after a grace period.  So the only use case I
> can see--executing the callback ASAP as the required grace period has
> already elapsed--would still require an extra argument to call_rcu()
> for it to properly schedule the callback, no?  I.e.
> 
> 	call_rcu_after(head, func, generation)
> 
> What am I missing that would make the existing call_rcu() useful in
> combination with rcu_cookie_gp_elapsed()?

I was thinking of something like the following:

	rcu_get_gp_cookie(&wherever);

	...

	if (!rcu_cookie_gp_elapsed(&wherever))
		call_rcu(&p->rcu, my_callback);
	else
		my_callback(&p->rcu);

> > The API I am currently proposing allows either synchronize_rcu() or
> > call_rcu() to be used.  In addition, it allows alternative
> > algorithms, for example:
> > 
> > 	rcu_get_gp_cookie(&wherever);
> > 
> > 	...
> > 
> > 	if (rcu_cookie_gp_elapsed(&wherever))
> > 		p = old_pointer;  /* now safe to re-use. */
> > 	else
> > 		p = kmalloc( ... );  /* can't re-use, so get new memory. */
> 
> I have to admit that I am not imaginative enough right now to put this
> in a real life scenario.  But it does look more flexible.
> 
> Though it must be made clear that it may never return true, so
> anything essential (like _freeing_ old memory) may never rely on it.

Good point!  And even if it only returned false sometimes, one needs
to avoid leaking the memory referenced by old_pointer.  Which should
hopefully take care of the case where it always returns false.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 17:52               ` Johannes Weiner
  2011-08-12 18:13                 ` Paul E. McKenney
@ 2011-08-12 22:22                 ` Andrea Arcangeli
  1 sibling, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-08-12 22:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Paul E. McKenney, Michel Lespinasse, Minchan Kim, Hugh Dickins,
	linux-mm, Andrew Morton, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 07:52:06PM +0200, Johannes Weiner wrote:
> Do we even need to make this fixed?  It can be unsigned long long for
> now, but I could imagine leaving it up to the user depending how much
> space she is able/willing to invest to save time:
> 
> 	void synchronize_rcu_with(unsigned long time, unsigned int bits)
> 	{
> 		if (generation_counter & ((1 << bits) - 1) == time)
> 			synchronize_rcu();
> 	}
> 
> If you have only 3 bits to store the time, you will synchronize
> falsely to every 8th phase.  Better than nothing, right?

Ok if any of the bits is different we can safely skip it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-09 22:22       ` Minchan Kim
@ 2011-08-12 22:35         ` Michel Lespinasse
  2011-08-13  4:07           ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-12 22:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrea Arcangeli, Hugh Dickins, linux-mm, Andrew Morton,
	Paul E. McKenney, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Tue, Aug 9, 2011 at 3:22 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 8:04 PM, Michel Lespinasse <walken@google.com> wrote:
>> On Sun, Aug 7, 2011 at 7:25 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
>>> On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote:
>>>> I can see that the page_cache_get_speculative comment in
>>>> include/linux/pagemap.h maps out one way to prevent the issue. If
>>>> thread T continually held an rcu read lock from the time it finds the
>>>> pointer to P until the time it calls get_page_unless_zero on that
>>>> page, AND there was a synchronize_rcu() call somewhere between the
>>>> time a THP page gets allocated and the time __split_huge_page_refcount
>>>> might first get called on that page, then things would be safe.
>>>> However, that does not seem to be true today: I could not find a
>>>> synchronize_rcu() call before __split_huge_page_refcount(), AND there
>>>> are also places (such as deactivate_page() for example) that call
>>>> get_page_unless_zero without being within an rcu read locked section
>>>> (or holding the zone lru lock to provide exclusion against
>>>> __split_huge_page_refcount).
>>
>> Going forward, I can see several possible solutions:
>> - Use my proposed page count lock in order to avoid the race. One
>> would have to convert all get_page_unless_zero() sites to use it. I
>> expect the cost would be low but still measurable.
>
> It's not necessary to apply it on *all* get_page_unless_zero sites.
> Because deactivate_page does it on file pages while THP handles only anon pages.
> So the race should not a problem.

But it doesn't matter what kind of page the get_page_unless_zero call
site hopes to get a reference on - if it doesn't already hold a
reference on the page (either directly as a reference, or if a known
mapping points to that page and the page table lock is taken or
interrupts are disabled in order to guarantee the mapping won't get
yanked), then the page can get yanked and a THP page could show up
there before the call site gets a reference.

>> - Protect all get_page_unless_zero call sites with rcu read lock or
>> lru lock (page_cache_get_speculative already has it, but there are
>> others to consider), and add a synchronize_rcu() before splitting huge
>> pages.
>
> I think it can't be a solution.
> If we don't have any lock for protect write-side, page_count could be
> unstable again while we peek page->count in
> __split_huge_page_refcount after calling synchronize_rcu.
> Do I miss something?

The tail page count would be unstable for at most one rcu grace period
after the page got allocated. This is guaranteed by making all
get_page_unless_zero call sites make sure they somehow determine the
page is not a THP tail page (for example because they found it in
radix tree) before calling get_page_unless_zero and having an rcu read
lock wrapping these two together. This is basically the protocol
described in the comment for page_cache_get_speculative() in pagemap.h

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 15:36       ` Andrea Arcangeli
  2011-08-12 16:08         ` SPAM: " Paul E. McKenney
@ 2011-08-12 22:50         ` Michel Lespinasse
  2011-08-13  4:11         ` Minchan Kim
  2 siblings, 0 replies; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-12 22:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Hugh Dickins, linux-mm, Andrew Morton,
	Paul E. McKenney, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 8:36 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, Aug 09, 2011 at 04:04:21AM -0700, Michel Lespinasse wrote:
>> - Use my proposed page count lock in order to avoid the race. One
>> would have to convert all get_page_unless_zero() sites to use it. I
>> expect the cost would be low but still measurable.
>
> I didn't yet focus at your problem after we talked about it at MM
> summit, but I seem to recall I suggested there to just get to the head
> page and always take the lock on it. split_huge_page only works at 2M
> aligned pages, the rest you don't care about. Getting to the head page
> compound_lock should be always safe. And that will still scale
> incredibly better than taking the lru_lock for the whole zone (which
> would also work). And it seems the best way to stop split_huge_page
> without having to alter the put_page fast path when it works on head
> pages (the only thing that gets into put_page complex slow path is the
> release of tail pages after get_user_pages* so it'd be nice if
> put_page fast path still didn't need to take locks).

We did talk about it. At some point I thought it might work :)

The problem case there is this. Say the page I want to
get_page_unless_zero is a single page, and the page at the prior 2M
aligned boundary is currently free. I can't rely on the desired page
not getting reallocated, because I don't have a reference on it yet.
But I can't make things safe by taking a reference and/or the compound
lock on the aligned page either, because its refcount currently is
zero.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 16:08         ` SPAM: " Paul E. McKenney
  2011-08-12 16:43           ` Andrea Arcangeli
  2011-08-12 16:57           ` Johannes Weiner
@ 2011-08-12 23:02           ` Michel Lespinasse
  2 siblings, 0 replies; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-12 23:02 UTC (permalink / raw)
  To: paulmck
  Cc: Andrea Arcangeli, Minchan Kim, Hugh Dickins, linux-mm,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 9:08 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Aug 12, 2011 at 05:36:16PM +0200, Andrea Arcangeli wrote:
>> On Tue, Aug 09, 2011 at 04:04:21AM -0700, Michel Lespinasse wrote:
>> > - It'd be sweet if one could somehow record the time a THP page was
>> > created, and wait for at least one RCU grace period *starting from the
>> > recorded THP creation time* before splitting huge pages. In practice,
>> > we would be very unlikely to have to wait since the grace period would
>> > be already expired. However, I don't think RCU currently provides such
>> > a mechanism - Paul, is this something that would seem easy to
>> > implement or not ?
>
> It should not be hard.  I already have an API for rcutorture testing
> use, but it is not appropriate for your use because it is unsynchronized.

Yay!

> We need to be careful with what I give you and how you interpret it.
> The most effective approach would be for me to give you an API that
> filled in a cookie given a pointer to one, then another API that took
> pointers to a pair of cookies and returned saying whether or not a
> grace period had elapsed.  You would do something like the following:
>
>        rcu_get_gp_cookie(&pagep->rcucookie);
>        . . .
>
>        rcu_get_gp_cookie(&autovarcookie);
>        if (!rcu_cookie_gp_elapsed(&pagep->rcucookie, &autovarcookie))
>                synchronize_rcu();

This would work. The minimal interface I actually need would be:

> So, how much space do I get for ->rcucookie?  By default, it is a pair
> of unsigned longs, but I could live with as small as a single byte if
> you didn't mind a high probability of false negatives (me telling you
> to do a grace period despite 16 of them having happened in the meantime
> due to overflow of a 4-bit field in the byte).

Two longs per cookie would work. We could most easily store them in
(page_head+2)->lru. This assumes THP pages will always be at least
order 2, but I don't think that's a problem.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 17:27             ` Paul E. McKenney
@ 2011-08-12 23:45               ` Michel Lespinasse
  2011-08-13  1:57                 ` Paul E. McKenney
  0 siblings, 1 reply; 34+ messages in thread
From: Michel Lespinasse @ 2011-08-12 23:45 UTC (permalink / raw)
  To: paulmck
  Cc: Andrea Arcangeli, Minchan Kim, Hugh Dickins, linux-mm,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 10:27 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> Or maybe I make rcu_cookie_gp_elapsed() take only one cookie and
> compare it to the current cookie.  This would save a bit of code in
> the TINY cases:
>
>        rcu_get_gp_cookie(&pagep->rcucookie);
>        . . .
>
>        if (!rcu_cookie_gp_elapsed(&pagep->rcucookie))
>                synchronize_rcu();

Agree this looks nicer that having the second cookie on the stack. As
you said, this does not allow us to compare two past points in time,
but I really don't see a use case for that.

> How long would there normally be between recording the cookie and
> checking for the need for a grace period?  One disk access?  One HZ?
> Something else?

I would expect >>10 seconds in the normal case ? I'm not sure how much
lower this may get in adverse workloads. Andrea ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 23:45               ` Michel Lespinasse
@ 2011-08-13  1:57                 ` Paul E. McKenney
  2011-08-13 23:56                   ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2011-08-13  1:57 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Minchan Kim, Hugh Dickins, linux-mm,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 04:45:59PM -0700, Michel Lespinasse wrote:
> On Fri, Aug 12, 2011 at 10:27 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > Or maybe I make rcu_cookie_gp_elapsed() take only one cookie and
> > compare it to the current cookie.  This would save a bit of code in
> > the TINY cases:
> >
> >        rcu_get_gp_cookie(&pagep->rcucookie);
> >        . . .
> >
> >        if (!rcu_cookie_gp_elapsed(&pagep->rcucookie))
> >                synchronize_rcu();
> 
> Agree this looks nicer that having the second cookie on the stack. As
> you said, this does not allow us to compare two past points in time,
> but I really don't see a use case for that.

And actually hand-writing the code got me the following API:

struct rcu_cookie;
void rcu_get_gp_cookie(struct rcu_cookie *rcp);
void rcu_gp_cookie_elapsed(struct rcu_cookie *rcp);

For TREE{_PREEMPT,}_RCU these are both external calls (#include hell
and all that).  For TINY{_PREEMPT,}_RCU they are both trivial inlineable
functions.

> > How long would there normally be between recording the cookie and
> > checking for the need for a grace period?  One disk access?  One HZ?
> > Something else?
> 
> I would expect >>10 seconds in the normal case ? I'm not sure how much
> lower this may get in adverse workloads. Andrea ?

>>10 seconds would be way more than enough to allow this to work well.
But if we are getting much below 100 milliseconds, we need to rethink
this.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 22:35         ` Michel Lespinasse
@ 2011-08-13  4:07           ` Minchan Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2011-08-13  4:07 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Hugh Dickins, linux-mm, Andrew Morton,
	Paul E. McKenney, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 03:35:25PM -0700, Michel Lespinasse wrote:
> On Tue, Aug 9, 2011 at 3:22 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Tue, Aug 9, 2011 at 8:04 PM, Michel Lespinasse <walken@google.com> wrote:
> >> On Sun, Aug 7, 2011 at 7:25 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> >>> On Thu, Aug 04, 2011 at 11:39:19PM -0700, Michel Lespinasse wrote:
> >>>> I can see that the page_cache_get_speculative comment in
> >>>> include/linux/pagemap.h maps out one way to prevent the issue. If
> >>>> thread T continually held an rcu read lock from the time it finds the
> >>>> pointer to P until the time it calls get_page_unless_zero on that
> >>>> page, AND there was a synchronize_rcu() call somewhere between the
> >>>> time a THP page gets allocated and the time __split_huge_page_refcount
> >>>> might first get called on that page, then things would be safe.
> >>>> However, that does not seem to be true today: I could not find a
> >>>> synchronize_rcu() call before __split_huge_page_refcount(), AND there
> >>>> are also places (such as deactivate_page() for example) that call
> >>>> get_page_unless_zero without being within an rcu read locked section
> >>>> (or holding the zone lru lock to provide exclusion against
> >>>> __split_huge_page_refcount).
> >>
> >> Going forward, I can see several possible solutions:
> >> - Use my proposed page count lock in order to avoid the race. One
> >> would have to convert all get_page_unless_zero() sites to use it. I
> >> expect the cost would be low but still measurable.
> >
> > It's not necessary to apply it on *all* get_page_unless_zero sites.
> > Because deactivate_page does it on file pages while THP handles only anon pages.
> > So the race should not a problem.
> 
> But it doesn't matter what kind of page the get_page_unless_zero call
> site hopes to get a reference on - if it doesn't already hold a
> reference on the page (either directly as a reference, or if a known
> mapping points to that page and the page table lock is taken or
> interrupts are disabled in order to guarantee the mapping won't get
> yanked), then the page can get yanked and a THP page could show up
> there before the call site gets a reference.

As I said, the caller of deactivate hold a reference at now so it should be okay.
But I admit deactivate_page doesn't have to call get_page_unless_zero but get_page
is enough if caller makes sure to hold a reference on the page like current situation.
I will add such comment on the deactivate_page and change get_page_unless_zero with
get_page for easy use in future.

I have't notice that.
Thanks for giving the chance to think of it, Michel.

> 
> >> - Protect all get_page_unless_zero call sites with rcu read lock or
> >> lru lock (page_cache_get_speculative already has it, but there are
> >> others to consider), and add a synchronize_rcu() before splitting huge
> >> pages.
> >
> > I think it can't be a solution.
> > If we don't have any lock for protect write-side, page_count could be
> > unstable again while we peek page->count in
> > __split_huge_page_refcount after calling synchronize_rcu.
> > Do I miss something?
> 
> The tail page count would be unstable for at most one rcu grace period
> after the page got allocated. This is guaranteed by making all
> get_page_unless_zero call sites make sure they somehow determine the
> page is not a THP tail page (for example because they found it in
> radix tree) before calling get_page_unless_zero and having an rcu read
> lock wrapping these two together. This is basically the protocol
> described in the comment for page_cache_get_speculative() in pagemap.h

Absolutely

> 
> -- 
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 15:36       ` Andrea Arcangeli
  2011-08-12 16:08         ` SPAM: " Paul E. McKenney
  2011-08-12 22:50         ` Michel Lespinasse
@ 2011-08-13  4:11         ` Minchan Kim
  2 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2011-08-13  4:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michel Lespinasse, Hugh Dickins, linux-mm, Andrew Morton,
	Paul E. McKenney, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 05:36:16PM +0200, Andrea Arcangeli wrote:
> On Tue, Aug 09, 2011 at 04:04:21AM -0700, Michel Lespinasse wrote:
> > - Use my proposed page count lock in order to avoid the race. One
> > would have to convert all get_page_unless_zero() sites to use it. I
> > expect the cost would be low but still measurable.
> 
> I didn't yet focus at your problem after we talked about it at MM
> summit, but I seem to recall I suggested there to just get to the head
> page and always take the lock on it. split_huge_page only works at 2M
> aligned pages, the rest you don't care about. Getting to the head page
> compound_lock should be always safe. And that will still scale
> incredibly better than taking the lru_lock for the whole zone (which
> would also work). And it seems the best way to stop split_huge_page
> without having to alter the put_page fast path when it works on head
> pages (the only thing that gets into put_page complex slow path is the
> release of tail pages after get_user_pages* so it'd be nice if
> put_page fast path still didn't need to take locks).
> 
> > - It'd be sweet if one could somehow record the time a THP page was
> > created, and wait for at least one RCU grace period *starting from the
> > recorded THP creation time* before splitting huge pages. In practice,
> > we would be very unlikely to have to wait since the grace period would
> > be already expired. However, I don't think RCU currently provides such
> > a mechanism - Paul, is this something that would seem easy to
> > implement or not ?
> 
> This looks sweet. We could store a quiescent points generation counter
> in the page[1].something, if the page has the same generation of the
> last RCU quiescent point (vs rcu_read_lock) we synchronize_rcu before
> starting split_huge_page. split_huge_page is serialized through the
> anon_vma lock however, so we'd need to release the anon_vma lock,
> synchronize_rcu and retry and this time the page[1].something sequence
> counter would be older than the rcu generation counter and it'll
> proceed (maybe another thread or process will get there first but
> that's ok).
> 
> I didn't have better ideas than yours above, but I'll keep thinking.
> 
> > > When I make deactivate_page, I didn't consider that honestly.
> > > IMHO, It shouldn't be a problem as deactive_page hold a reference
> > > of page by pagevec_lookup so the page shouldn't be gone under us.
> > 
> > Agree - it seems like you are guaranteed to already hold a reference
> > (but then a straight get_page should be sufficient, right ?)
> 
> I hope this is not an issue because of the fact the page is guaranteed
> not to be THP when get_page_unless_zero runs on it.

Yes. At the moment, it's not a problem as only caller(ie, invalidate_mapping_pages)
hold a reference on the page before calling. But if there is other usecase in future,
caller should keep in mind to prevent this problem.
So I will add comment about that and replace get_page_unless_zero with get_page to
prevent confusing.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-12 16:43           ` Andrea Arcangeli
  2011-08-12 17:27             ` Paul E. McKenney
@ 2011-08-13  4:18             ` Minchan Kim
  1 sibling, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2011-08-13  4:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Paul E. McKenney, Michel Lespinasse, Hugh Dickins, linux-mm,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 06:43:25PM +0200, Andrea Arcangeli wrote:
> On Fri, Aug 12, 2011 at 09:08:13AM -0700, Paul E. McKenney wrote:
> > It should not be hard.  I already have an API for rcutorture testing
> > use, but it is not appropriate for your use because it is unsynchronized.
> 
> Sounds good.
> 
> > We need to be careful with what I give you and how you interpret it.
> > The most effective approach would be for me to give you an API that
> > filled in a cookie given a pointer to one, then another API that took
> > pointers to a pair of cookies and returned saying whether or not a
> > grace period had elapsed.  You would do something like the following:
> 
> Even a raw number of events is ok, but it will work like a cookie.
> 
> > 	rcu_get_gp_cookie(&pagep->rcucookie);
> > 	. . .
> > 
> > 	rcu_get_gp_cookie(&autovarcookie);
> > 	if (!rcu_cookie_gp_elapsed(&pagep->rcucookie, &autovarcookie))
> > 		synchronize_rcu();
> > 
> > So, how much space do I get for ->rcucookie?  By default, it is a pair
> > of unsigned longs, but I could live with as small as a single byte if
> > you didn't mind a high probability of false negatives (me telling you
> > to do a grace period despite 16 of them having happened in the meantime
> > due to overflow of a 4-bit field in the byte).
> 
> It could be 2 longs just fine (so it's 64bit on 32bit too and guarantees
> no false positive as it'll never overflow for the lifetime of the
> hardware), we've tons of free space to use in page[1-511].* .
> 
> I'm currently unsure how the cookie can be allowed to be smaller than
> the real counter though. I don't see how is it possible.
> 
> > That covers TREE_RCU and TREE_PREEMPT_RCU, on to TINY_RCU and TINY_PREEMPT_RCU.
> > 
> > TINY_RCU will require more thought, as it doesn't bother counting grace
> > periods.  Ah, but in TINY_RCU, synchronize_rcu() is free, so I simply
> > make rcu_cookie_gp_elapsed() always return false.
> 
> Yes it'll surely be safe for us, on UP we have no race and in fact
> get_page_unless_zero isn't even called in the speculative lookup in UP. With
> the code above you could return always true with TINY_RCU and skip the
> call.
> 
> > OK, TINY_PREEMPT_RCU...  It doesn't count grace periods, either.  But it
> > is able to reliably detect if there are any RCU readers in flight,
> > and there normally won't be, so synchronize_rcu() is again free in the
> > common case.  And no, I don't want to count grace periods as this would
> > increase the memory footprint.  And the whole point of TINY_PREEMPT_RCU
> > is to be tiny, after all.  ;-)
> 
> Ok so it returns always false, and synchronize_rcu is always called,
> but it will normally do nothing there.
> 
> > If you need SRCU, you are out of luck until I get my act together and
> > merge it in with the other RCU implementations, which might be awhile
> > still.
> 
> Good luck because we don't need SRCU, we just need a synchronize_rcu
> vs rcu_read_lock.
> 
> > For TREE_*RCU, the calls to rcu_get_gp_cookie() will cost you a lock
> > round trip.  I am hoping to be able to use the counters stored in the
> > rcu_data structure, which means that I would need to disable preemption
> > and re-enable it.  Or maybe disable and re-enable irqs instead, not yet
> > sure which.  This might require me to be conservative and make
> > rcu_cookie_gp_elapsed() unless two grace periods have elapsed.  Things
> > get a bit tricky -- yes, I could just use the global counters, but that
> > would mean that rcu_get_gp_cookie() would need to acquire a global lock,
> > and I suspect that you intend to invoke it too often for that to be
> > a winning strategy.
> 
> It is invoked at every page allocation, there are some locks taken

For the clarification, every page allocation? Really?
I guess you mean every page allocation for THP.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/3] page count lock for simpler put_page
  2011-08-13  1:57                 ` Paul E. McKenney
@ 2011-08-13 23:56                   ` Andrea Arcangeli
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-08-13 23:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michel Lespinasse, Minchan Kim, Hugh Dickins, linux-mm,
	Andrew Morton, Rik van Riel, Mel Gorman, Johannes Weiner,
	KOSAKI Motohiro, Shaohua Li

On Fri, Aug 12, 2011 at 06:57:41PM -0700, Paul E. McKenney wrote:
> But if we are getting much below 100 milliseconds, we need to rethink
> this.

The delay may be low in some corner case but this still benefits by
running it only once. You can mmap() bzero, mremap(+4096) (if mremap
moves the pages to a not aligned 2m address, it forces a
split_huge_page, an hardware issue) and all pages will be splitted in
potentially less than 100msec if they're only a few. At least we'll be
running synchronize_rcu only once instead of for every hugepage, as
long as it runs only once I guess we're ok. Normally it shouldn't
happen so fast. My current /proc/vmstat says there are 271 splits for
97251 THP allocated and they're not so likely to have happened within
100msec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-08-13 23:56 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-04 21:07 [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
2011-08-04 21:07 ` [RFC PATCH 1/3] mm: Replace naked page->_count accesses with accessor functions Michel Lespinasse
2011-08-04 21:07 ` [RFC PATCH 2/3] mm: page count lock Michel Lespinasse
2011-08-07 14:00   ` Minchan Kim
2011-08-04 21:07 ` [RFC PATCH 3/3] mm: get_first_page_unless_zero() Michel Lespinasse
2011-08-07 14:13   ` Minchan Kim
2011-08-05  6:39 ` [RFC PATCH 0/3] page count lock for simpler put_page Michel Lespinasse
2011-08-07 14:25   ` Minchan Kim
2011-08-09 11:04     ` Michel Lespinasse
2011-08-09 22:22       ` Minchan Kim
2011-08-12 22:35         ` Michel Lespinasse
2011-08-13  4:07           ` Minchan Kim
2011-08-12 15:36       ` Andrea Arcangeli
2011-08-12 16:08         ` SPAM: " Paul E. McKenney
2011-08-12 16:43           ` Andrea Arcangeli
2011-08-12 17:27             ` Paul E. McKenney
2011-08-12 23:45               ` Michel Lespinasse
2011-08-13  1:57                 ` Paul E. McKenney
2011-08-13 23:56                   ` Andrea Arcangeli
2011-08-13  4:18             ` Minchan Kim
2011-08-12 16:57           ` Johannes Weiner
2011-08-12 17:08             ` Andrea Arcangeli
2011-08-12 17:52               ` Johannes Weiner
2011-08-12 18:13                 ` Paul E. McKenney
2011-08-12 19:05                   ` Johannes Weiner
2011-08-12 22:14                     ` Paul E. McKenney
2011-08-12 22:22                 ` Andrea Arcangeli
2011-08-12 18:03               ` Paul E. McKenney
2011-08-12 17:41             ` Paul E. McKenney
2011-08-12 17:56               ` Johannes Weiner
2011-08-12 23:02           ` Michel Lespinasse
2011-08-12 22:50         ` Michel Lespinasse
2011-08-13  4:11         ` Minchan Kim
2011-08-12 16:58   ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.