All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 0/5] Fix compound_head() race
@ 2015-08-19  9:21 ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

Here's my attempt on fixing recently discovered race in compound_head().
It should make compound_head() reliable in all contexts.

The patchset is against Linus' tree. Let me know if it need to be rebased
onto different baseline.

It's expected to have conflicts with my page-flags patchset and probably
should be applied before it.

v3:
   - Fix build without hugetlb;
   - Drop page->first_page;
   - Update comment for free_compound_page();
   - Use 'unsigned int' for page order;

v2: Per Hugh's suggestion page->compound_head is moved into third double
    word. This way we can avoid memory overhead which v1 had in some
    cases.

    This place in struct page is rather overloaded. More testing is
    required to make sure we don't collide with anyone.

Kirill A. Shutemov (5):
  mm: drop page->slab_page
  zsmalloc: use page->private instead of page->first_page
  mm: pack compound_dtor and compound_order into one word in struct page
  mm: make compound_head() robust
  mm: use 'unsigned int' for page order

 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig      |  1 -
 include/linux/mm.h                     | 82 +++++++++++-----------------------
 include/linux/mm_types.h               | 21 ++++++---
 include/linux/page-flags.h             | 80 ++++++++-------------------------
 mm/Kconfig                             | 12 -----
 mm/debug.c                             |  5 ---
 mm/huge_memory.c                       |  3 +-
 mm/hugetlb.c                           | 35 +++++++--------
 mm/internal.h                          |  8 ++--
 mm/memory-failure.c                    |  7 ---
 mm/page_alloc.c                        | 76 ++++++++++++++++++-------------
 mm/swap.c                              |  4 +-
 mm/zsmalloc.c                          | 11 +++--
 14 files changed, 133 insertions(+), 216 deletions(-)

-- 
2.5.0


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCHv3 0/5] Fix compound_head() race
@ 2015-08-19  9:21 ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

Here's my attempt on fixing recently discovered race in compound_head().
It should make compound_head() reliable in all contexts.

The patchset is against Linus' tree. Let me know if it need to be rebased
onto different baseline.

It's expected to have conflicts with my page-flags patchset and probably
should be applied before it.

v3:
   - Fix build without hugetlb;
   - Drop page->first_page;
   - Update comment for free_compound_page();
   - Use 'unsigned int' for page order;

v2: Per Hugh's suggestion page->compound_head is moved into third double
    word. This way we can avoid memory overhead which v1 had in some
    cases.

    This place in struct page is rather overloaded. More testing is
    required to make sure we don't collide with anyone.

Kirill A. Shutemov (5):
  mm: drop page->slab_page
  zsmalloc: use page->private instead of page->first_page
  mm: pack compound_dtor and compound_order into one word in struct page
  mm: make compound_head() robust
  mm: use 'unsigned int' for page order

 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig      |  1 -
 include/linux/mm.h                     | 82 +++++++++++-----------------------
 include/linux/mm_types.h               | 21 ++++++---
 include/linux/page-flags.h             | 80 ++++++++-------------------------
 mm/Kconfig                             | 12 -----
 mm/debug.c                             |  5 ---
 mm/huge_memory.c                       |  3 +-
 mm/hugetlb.c                           | 35 +++++++--------
 mm/internal.h                          |  8 ++--
 mm/memory-failure.c                    |  7 ---
 mm/page_alloc.c                        | 76 ++++++++++++++++++-------------
 mm/swap.c                              |  4 +-
 mm/zsmalloc.c                          | 11 +++--
 14 files changed, 133 insertions(+), 216 deletions(-)

-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCHv3 1/5] mm: drop page->slab_page
  2015-08-19  9:21 ` Kirill A. Shutemov
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov, Joonsoo Kim, Andi Kleen

Since 8456a648cf44 ("slab: use struct page for slab management") nobody
uses slab_page field in struct page.

Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <ak@linux.intel.com>
---
 include/linux/mm_types.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..58620ac7f15c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -140,7 +140,6 @@ struct page {
 #endif
 		};
 
-		struct slab *slab_page; /* slab fields */
 		struct rcu_head rcu_head;	/* Used by SLAB
 						 * when destroying via RCU
 						 */
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 1/5] mm: drop page->slab_page
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov, Joonsoo Kim, Andi Kleen

Since 8456a648cf44 ("slab: use struct page for slab management") nobody
uses slab_page field in struct page.

Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <ak@linux.intel.com>
---
 include/linux/mm_types.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..58620ac7f15c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -140,7 +140,6 @@ struct page {
 #endif
 		};
 
-		struct slab *slab_page; /* slab fields */
 		struct rcu_head rcu_head;	/* Used by SLAB
 						 * when destroying via RCU
 						 */
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 2/5] zsmalloc: use page->private instead of page->first_page
  2015-08-19  9:21 ` Kirill A. Shutemov
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

We are going to rework how compound_head() work. It will not use
page->first_page as we have it now.

The only other user of page->fisrt_page beyond compound pages is
zsmalloc.

Let's use page->private instead of page->first_page here. It occupies
the same storage space.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/zsmalloc.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0a7f81aa2249..a85754e69879 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -16,7 +16,7 @@
  * struct page(s) to form a zspage.
  *
  * Usage of struct page fields:
- *	page->first_page: points to the first component (0-order) page
+ *	page->private: points to the first component (0-order) page
  *	page->index (union with page->freelist): offset of the first object
  *		starting in this page. For the first page, this is
  *		always 0, so we use this field (aka freelist) to point
@@ -26,8 +26,7 @@
  *
  *	For _first_ page only:
  *
- *	page->private (union with page->first_page): refers to the
- *		component page after the first page
+ *	page->private: refers to the component page after the first page
  *		If the page is first_page for huge object, it stores handle.
  *		Look at size_class->huge.
  *	page->freelist: points to the first free object in zspage.
@@ -770,7 +769,7 @@ static struct page *get_first_page(struct page *page)
 	if (is_first_page(page))
 		return page;
 	else
-		return page->first_page;
+		return (struct page *)page_private(page);
 }
 
 static struct page *get_next_page(struct page *page)
@@ -955,7 +954,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	 * Allocate individual pages and link them together as:
 	 * 1. first page->private = first sub-page
 	 * 2. all sub-pages are linked together using page->lru
-	 * 3. each sub-page is linked to the first page using page->first_page
+	 * 3. each sub-page is linked to the first page using page->private
 	 *
 	 * For each size class, First/Head pages are linked together using
 	 * page->lru. Also, we set PG_private to identify the first page
@@ -980,7 +979,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
 		if (i >= 1)
-			page->first_page = first_page;
+			set_page_private(first_page, (unsigned long)first_page);
 		if (i >= 2)
 			list_add(&page->lru, &prev_page->lru);
 		if (i == class->pages_per_zspage - 1)	/* last page */
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 2/5] zsmalloc: use page->private instead of page->first_page
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

We are going to rework how compound_head() work. It will not use
page->first_page as we have it now.

The only other user of page->fisrt_page beyond compound pages is
zsmalloc.

Let's use page->private instead of page->first_page here. It occupies
the same storage space.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/zsmalloc.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0a7f81aa2249..a85754e69879 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -16,7 +16,7 @@
  * struct page(s) to form a zspage.
  *
  * Usage of struct page fields:
- *	page->first_page: points to the first component (0-order) page
+ *	page->private: points to the first component (0-order) page
  *	page->index (union with page->freelist): offset of the first object
  *		starting in this page. For the first page, this is
  *		always 0, so we use this field (aka freelist) to point
@@ -26,8 +26,7 @@
  *
  *	For _first_ page only:
  *
- *	page->private (union with page->first_page): refers to the
- *		component page after the first page
+ *	page->private: refers to the component page after the first page
  *		If the page is first_page for huge object, it stores handle.
  *		Look at size_class->huge.
  *	page->freelist: points to the first free object in zspage.
@@ -770,7 +769,7 @@ static struct page *get_first_page(struct page *page)
 	if (is_first_page(page))
 		return page;
 	else
-		return page->first_page;
+		return (struct page *)page_private(page);
 }
 
 static struct page *get_next_page(struct page *page)
@@ -955,7 +954,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	 * Allocate individual pages and link them together as:
 	 * 1. first page->private = first sub-page
 	 * 2. all sub-pages are linked together using page->lru
-	 * 3. each sub-page is linked to the first page using page->first_page
+	 * 3. each sub-page is linked to the first page using page->private
 	 *
 	 * For each size class, First/Head pages are linked together using
 	 * page->lru. Also, we set PG_private to identify the first page
@@ -980,7 +979,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
 		if (i >= 1)
-			page->first_page = first_page;
+			set_page_private(first_page, (unsigned long)first_page);
 		if (i >= 2)
 			list_add(&page->lru, &prev_page->lru);
 		if (i == class->pages_per_zspage - 1)	/* last page */
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
  2015-08-19  9:21 ` Kirill A. Shutemov
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

The patch halves space occupied by compound_dtor and compound_order in
struct page.

For compound_order, it's trivial long -> int/short conversion.

For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.

This patch free up one word in tail pages for reuse. This is preparation
for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h       | 24 +++++++++++++++++++-----
 include/linux/mm_types.h | 11 +++++++----
 mm/hugetlb.c             |  8 ++++----
 mm/page_alloc.c          | 11 ++++++++++-
 4 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f92dbac..0735bc0a351a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -575,18 +575,32 @@ int split_free_page(struct page *page);
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
- * These are _only_ valid on the head of a PG_compound page.
+ * These are _only_ valid on the head of a compound page.
  */
+typedef void compound_page_dtor(struct page *);
+
+/* Keep the enum in sync with compound_page_dtors array in mm/page_alloc.c */
+enum {
+	NULL_COMPOUND_DTOR,
+	COMPOUND_PAGE_DTOR,
+#ifdef CONFIG_HUGETLB_PAGE
+	HUGETLB_PAGE_DTOR,
+#endif
+	NR_COMPOUND_DTORS,
+};
+extern compound_page_dtor * const compound_page_dtors[];
 
 static inline void set_compound_page_dtor(struct page *page,
-						compound_page_dtor *dtor)
+		unsigned int compound_dtor)
 {
-	page[1].compound_dtor = dtor;
+	VM_BUG_ON_PAGE(compound_dtor >= NR_COMPOUND_DTORS, page);
+	page[1].compound_dtor = compound_dtor;
 }
 
 static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 {
-	return page[1].compound_dtor;
+	VM_BUG_ON_PAGE(page[1].compound_dtor >= NR_COMPOUND_DTORS, page);
+	return compound_page_dtors[page[1].compound_dtor];
 }
 
 static inline int compound_order(struct page *page)
@@ -596,7 +610,7 @@ static inline int compound_order(struct page *page)
 	return page[1].compound_order;
 }
 
-static inline void set_compound_order(struct page *page, unsigned long order)
+static inline void set_compound_order(struct page *page, unsigned int order)
 {
 	page[1].compound_order = order;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 58620ac7f15c..63cdfe7ec336 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -28,8 +28,6 @@ struct mem_cgroup;
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
 #define ALLOC_SPLIT_PTLOCKS	(SPINLOCK_SIZE > BITS_PER_LONG/8)
 
-typedef void compound_page_dtor(struct page *);
-
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
@@ -145,8 +143,13 @@ struct page {
 						 */
 		/* First tail page of compound page */
 		struct {
-			compound_page_dtor *compound_dtor;
-			unsigned long compound_order;
+#ifdef CONFIG_64BIT
+			unsigned int compound_dtor;
+			unsigned int compound_order;
+#else
+			unsigned short int compound_dtor;
+			unsigned short int compound_order;
+#endif
 		};
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087089d8..8ea74caa1fa8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -969,7 +969,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_writeback);
 	}
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-	set_compound_page_dtor(page, NULL);
+	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
@@ -1065,7 +1065,7 @@ void free_huge_page(struct page *page)
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	INIT_LIST_HEAD(&page->lru);
-	set_compound_page_dtor(page, free_huge_page);
+	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	spin_lock(&hugetlb_lock);
 	set_hugetlb_cgroup(page, NULL);
 	h->nr_huge_pages++;
@@ -1117,7 +1117,7 @@ int PageHuge(struct page *page)
 		return 0;
 
 	page = compound_head(page);
-	return get_compound_page_dtor(page) == free_huge_page;
+	return page[1].compound_dtor == HUGETLB_PAGE_DTOR;
 }
 EXPORT_SYMBOL_GPL(PageHuge);
 
@@ -1314,7 +1314,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 	if (page) {
 		INIT_LIST_HEAD(&page->lru);
 		r_nid = page_to_nid(page);
-		set_compound_page_dtor(page, free_huge_page);
+		set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 		set_hugetlb_cgroup(page, NULL);
 		/*
 		 * We incremented the global counters already
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df959b7d6085..c6733cc3cbce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,15 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "Movable",
 };
 
+static void free_compound_page(struct page *page);
+compound_page_dtor * const compound_page_dtors[] = {
+	NULL,
+	free_compound_page,
+#ifdef CONFIG_HUGETLB_PAGE
+	free_huge_page,
+#endif
+};
+
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 
@@ -437,7 +446,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 	int i;
 	int nr_pages = 1 << order;
 
-	set_compound_page_dtor(page, free_compound_page);
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
 	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

The patch halves space occupied by compound_dtor and compound_order in
struct page.

For compound_order, it's trivial long -> int/short conversion.

For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.

This patch free up one word in tail pages for reuse. This is preparation
for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h       | 24 +++++++++++++++++++-----
 include/linux/mm_types.h | 11 +++++++----
 mm/hugetlb.c             |  8 ++++----
 mm/page_alloc.c          | 11 ++++++++++-
 4 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f92dbac..0735bc0a351a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -575,18 +575,32 @@ int split_free_page(struct page *page);
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
- * These are _only_ valid on the head of a PG_compound page.
+ * These are _only_ valid on the head of a compound page.
  */
+typedef void compound_page_dtor(struct page *);
+
+/* Keep the enum in sync with compound_page_dtors array in mm/page_alloc.c */
+enum {
+	NULL_COMPOUND_DTOR,
+	COMPOUND_PAGE_DTOR,
+#ifdef CONFIG_HUGETLB_PAGE
+	HUGETLB_PAGE_DTOR,
+#endif
+	NR_COMPOUND_DTORS,
+};
+extern compound_page_dtor * const compound_page_dtors[];
 
 static inline void set_compound_page_dtor(struct page *page,
-						compound_page_dtor *dtor)
+		unsigned int compound_dtor)
 {
-	page[1].compound_dtor = dtor;
+	VM_BUG_ON_PAGE(compound_dtor >= NR_COMPOUND_DTORS, page);
+	page[1].compound_dtor = compound_dtor;
 }
 
 static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 {
-	return page[1].compound_dtor;
+	VM_BUG_ON_PAGE(page[1].compound_dtor >= NR_COMPOUND_DTORS, page);
+	return compound_page_dtors[page[1].compound_dtor];
 }
 
 static inline int compound_order(struct page *page)
@@ -596,7 +610,7 @@ static inline int compound_order(struct page *page)
 	return page[1].compound_order;
 }
 
-static inline void set_compound_order(struct page *page, unsigned long order)
+static inline void set_compound_order(struct page *page, unsigned int order)
 {
 	page[1].compound_order = order;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 58620ac7f15c..63cdfe7ec336 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -28,8 +28,6 @@ struct mem_cgroup;
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
 #define ALLOC_SPLIT_PTLOCKS	(SPINLOCK_SIZE > BITS_PER_LONG/8)
 
-typedef void compound_page_dtor(struct page *);
-
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
@@ -145,8 +143,13 @@ struct page {
 						 */
 		/* First tail page of compound page */
 		struct {
-			compound_page_dtor *compound_dtor;
-			unsigned long compound_order;
+#ifdef CONFIG_64BIT
+			unsigned int compound_dtor;
+			unsigned int compound_order;
+#else
+			unsigned short int compound_dtor;
+			unsigned short int compound_order;
+#endif
 		};
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087089d8..8ea74caa1fa8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -969,7 +969,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_writeback);
 	}
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-	set_compound_page_dtor(page, NULL);
+	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
@@ -1065,7 +1065,7 @@ void free_huge_page(struct page *page)
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	INIT_LIST_HEAD(&page->lru);
-	set_compound_page_dtor(page, free_huge_page);
+	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	spin_lock(&hugetlb_lock);
 	set_hugetlb_cgroup(page, NULL);
 	h->nr_huge_pages++;
@@ -1117,7 +1117,7 @@ int PageHuge(struct page *page)
 		return 0;
 
 	page = compound_head(page);
-	return get_compound_page_dtor(page) == free_huge_page;
+	return page[1].compound_dtor == HUGETLB_PAGE_DTOR;
 }
 EXPORT_SYMBOL_GPL(PageHuge);
 
@@ -1314,7 +1314,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 	if (page) {
 		INIT_LIST_HEAD(&page->lru);
 		r_nid = page_to_nid(page);
-		set_compound_page_dtor(page, free_huge_page);
+		set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 		set_hugetlb_cgroup(page, NULL);
 		/*
 		 * We incremented the global counters already
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df959b7d6085..c6733cc3cbce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,15 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "Movable",
 };
 
+static void free_compound_page(struct page *page);
+compound_page_dtor * const compound_page_dtors[] = {
+	NULL,
+	free_compound_page,
+#ifdef CONFIG_HUGETLB_PAGE
+	free_huge_page,
+#endif
+};
+
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 
@@ -437,7 +446,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 	int i;
 	int nr_pages = 1 << order;
 
-	set_compound_page_dtor(page, free_compound_page);
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
 	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-19  9:21 ` Kirill A. Shutemov
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

	CPU0					CPU1

isolate_migratepages_block()
  page_count()
    compound_head()
      !!PageTail() == true
					put_page()
					  tail->first_page = NULL
      head = tail->first_page
					alloc_pages(__GFP_COMP)
					   prep_compound_page()
					     tail->first_page = head
					     __SetPageTail(p);
      !!PageTail() == true
    <head == NULL dereferencing>

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. That means it shares storage
space with:

 - page->lru.next;
 - page->next;
 - page->rcu_head.next;
 - page->pmd_huge_pte;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word. It can be used to encode PageTail(). And if the bit
set, rest of the word is pointer to head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
---
 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig      |  1 -
 include/linux/mm.h                     | 53 ++--------------------
 include/linux/mm_types.h               |  9 +++-
 include/linux/page-flags.h             | 80 ++++++++--------------------------
 mm/Kconfig                             | 12 -----
 mm/debug.c                             |  5 ---
 mm/huge_memory.c                       |  3 +-
 mm/hugetlb.c                           |  8 +---
 mm/internal.h                          |  4 +-
 mm/memory-failure.c                    |  7 ---
 mm/page_alloc.c                        | 38 ++++++++--------
 mm/swap.c                              |  4 +-
 13 files changed, 58 insertions(+), 170 deletions(-)

diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock
index 6dea4fd5c961..62842a857dab 100644
--- a/Documentation/vm/split_page_table_lock
+++ b/Documentation/vm/split_page_table_lock
@@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
 which must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
-allocation: slab uses page->slab_cache and page->first_page for its pages.
-These fields share storage with page->ptl.
+allocation: slab uses page->slab_cache for its pages.
+This field shares storage with page->ptl.
 
 PMD split lock only makes sense if you have more than two page table
 levels.
diff --git a/arch/xtensa/configs/iss_defconfig b/arch/xtensa/configs/iss_defconfig
index e4d193e7a300..5c7c385f21c4 100644
--- a/arch/xtensa/configs/iss_defconfig
+++ b/arch/xtensa/configs/iss_defconfig
@@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
 # CONFIG_SPARSEMEM_MANUAL is not set
 CONFIG_FLATMEM=y
 CONFIG_FLAT_NODE_MEM_MAP=y
-CONFIG_PAGEFLAGS_EXTENDED=y
 CONFIG_SPLIT_PTLOCK_CPUS=4
 # CONFIG_PHYS_ADDR_T_64BIT is not set
 CONFIG_ZONE_DMA_FLAG=1
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0735bc0a351a..a4c4b7d07473 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct page *page,
 #endif
 }
 
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
-	struct page *head = tail->first_page;
-
-	/*
-	 * page->first_page may be a dangling pointer to an old
-	 * compound page, so recheck that it is still a tail
-	 * page before returning.
-	 */
-	smp_rmb();
-	if (likely(PageTail(tail)))
-		return head;
-	return tail;
-}
-
-/*
- * Since either compound page could be dismantled asynchronously in THP
- * or we access asynchronously arbitrary positioned struct page, there
- * would be tail flag race. To handle this race, we should call
- * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
- */
-static inline struct page *compound_head(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return compound_head_by_tail(page);
-	return page;
-}
-
-/*
- * If we access compound page synchronously such as access to
- * allocated page, there is no need to handle tail flag race, so we can
- * check tail flag directly without any synchronization primitive.
- */
-static inline struct page *compound_head_fast(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return page->first_page;
-	return page;
-}
-
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -525,7 +485,7 @@ static inline void get_huge_page_tail(struct page *page)
 	VM_BUG_ON_PAGE(!PageTail(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
+	if (compound_tail_refcounted(compound_head(page)))
 		atomic_inc(&page->_mapcount);
 }
 
@@ -548,13 +508,7 @@ static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
 
-	/*
-	 * We don't need to worry about synchronization of tail flag
-	 * when we call virt_to_head_page() since it is only called for
-	 * already allocated page and this page won't be freed until
-	 * this virt_to_head_page() is finished. So use _fast variant.
-	 */
-	return compound_head_fast(page);
+	return compound_head(page);
 }
 
 /*
@@ -1496,8 +1450,7 @@ static inline bool ptlock_init(struct page *page)
 	 * with 0. Make sure nobody took it in use in between.
 	 *
 	 * It can happen if arch try to use slab for page table allocation:
-	 * slab code uses page->slab_cache and page->first_page (for tail
-	 * pages), which share storage with page->ptl.
+	 * slab code uses page->slab_cache, which share storage with page->ptl.
 	 */
 	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
 	if (!ptlock_alloc(page))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 63cdfe7ec336..e324768b6cc7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,7 +120,12 @@ struct page {
 		};
 	};
 
-	/* Third double word block */
+	/*
+	 * Third double word block
+	 *
+	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
+	 * for non-tail pages.
+	 */
 	union {
 		struct list_head lru;	/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
@@ -143,6 +148,7 @@ struct page {
 						 */
 		/* First tail page of compound page */
 		struct {
+			unsigned long compound_head; /* If bit zero is set */
 #ifdef CONFIG_64BIT
 			unsigned int compound_dtor;
 			unsigned int compound_order;
@@ -174,7 +180,6 @@ struct page {
 #endif
 #endif
 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
-		struct page *first_page;	/* Compound tail pages */
 	};
 
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 41c93844fb1d..9b865158e452 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -86,12 +86,7 @@ enum pageflags {
 	PG_private,		/* If pagecache, has fs-private data */
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	PG_head,		/* A head page */
-	PG_tail,		/* A tail page */
-#else
-	PG_compound,		/* A compound page */
-#endif
 	PG_swapcache,		/* Swap page: swp_entry_t in private */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
@@ -387,85 +382,46 @@ static inline void set_page_writeback_keepwrite(struct page *page)
 	test_set_page_writeback_keepwrite(page);
 }
 
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
-/*
- * System with lots of page flags available. This allows separate
- * flags for PageHead() and PageTail() checks of compound pages so that bit
- * tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths except arch/powerpc/mm/init_64.c
- * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
- * and avoid handling those in real mode.
- */
 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
-__PAGEFLAG(Tail, tail)
 
-static inline int PageCompound(struct page *page)
-{
-	return page->flags & ((1L << PG_head) | (1L << PG_tail));
-
-}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline void ClearPageCompound(struct page *page)
+static inline int PageTail(struct page *page)
 {
-	BUG_ON(!PageHead(page));
-	ClearPageHead(page);
+	return READ_ONCE(page->compound_head) & 1;
 }
-#endif
-
-#define PG_head_mask ((1L << PG_head))
 
-#else
-/*
- * Reduce page flag use as much as possible by overlapping
- * compound page flags with the flags used for page cache pages. Possible
- * because PageCompound is always set for compound pages and not for
- * pages on the LRU and/or pagecache.
- */
-TESTPAGEFLAG(Compound, compound)
-__SETPAGEFLAG(Head, compound)  __CLEARPAGEFLAG(Head, compound)
-
-/*
- * PG_reclaim is used in combination with PG_compound to mark the
- * head and tail of a compound page. This saves one page flag
- * but makes it impossible to use compound pages for the page cache.
- * The PG_reclaim bit would have to be used for reclaim or readahead
- * if compound pages enter the page cache.
- *
- * PG_compound & PG_reclaim	=> Tail page
- * PG_compound & ~PG_reclaim	=> Head page
- */
-#define PG_head_mask ((1L << PG_compound))
-#define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
-
-static inline int PageHead(struct page *page)
+static inline void set_compound_head(struct page *page, struct page *head)
 {
-	return ((page->flags & PG_head_tail_mask) == PG_head_mask);
+	WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
 }
 
-static inline int PageTail(struct page *page)
+static inline void clear_compound_head(struct page *page)
 {
-	return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask);
+	WRITE_ONCE(page->compound_head, 0);
 }
 
-static inline void __SetPageTail(struct page *page)
+static inline struct page *compound_head(struct page *page)
 {
-	page->flags |= PG_head_tail_mask;
+	unsigned long head = READ_ONCE(page->compound_head);
+
+	if (unlikely(head & 1))
+		return (struct page *) (head - 1);
+	return page;
 }
 
-static inline void __ClearPageTail(struct page *page)
+static inline int PageCompound(struct page *page)
 {
-	page->flags &= ~PG_head_tail_mask;
-}
+	return PageHead(page) || PageTail(page);
 
+}
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void ClearPageCompound(struct page *page)
 {
-	BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
-	clear_bit(PG_compound, &page->flags);
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
 }
 #endif
 
-#endif /* !PAGEFLAGS_EXTENDED */
+#define PG_head_mask ((1L << PG_head))
 
 #ifdef CONFIG_HUGETLB_PAGE
 int PageHuge(struct page *page);
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..454579d31081 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,18 +200,6 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
-#
-# If we have space for more page flags then we can enable additional
-# optimizations and functionality.
-#
-# Regular Sparsemem takes page flag bits for the sectionid if it does not
-# use a virtual memmap. Disable extended page flags for 32 bit platforms
-# that require the use of a sectionid in the page flags.
-#
-config PAGEFLAGS_EXTENDED
-	def_bool y
-	depends on 64BIT || SPARSEMEM_VMEMMAP || !SPARSEMEM
-
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
diff --git a/mm/debug.c b/mm/debug.c
index 76089ddf99ea..205e5ef957ab 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,12 +25,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private,		"private"	},
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{1UL << PG_head,		"head"		},
-	{1UL << PG_tail,		"tail"		},
-#else
-	{1UL << PG_compound,		"compound"	},
-#endif
 	{1UL << PG_swapcache,		"swapcache"	},
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},
 	{1UL << PG_reclaim,		"reclaim"	},
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 097c7a4bfbd9..330377f83ac7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
 				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/* clear PageTail before overwriting first_page */
-		smp_wmb();
+		clear_compound_head(page_tail);
 
 		/*
 		 * __split_huge_page_splitting() already set the
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ea74caa1fa8..53c0709fd87b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -824,9 +824,8 @@ static void destroy_compound_gigantic_page(struct page *page,
 	struct page *p = page + 1;
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		__ClearPageTail(p);
+		clear_compound_head(p);
 		set_page_refcounted(p);
-		p->first_page = NULL;
 	}
 
 	set_compound_order(page, 0);
@@ -1099,10 +1098,7 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order)
 		 */
 		__ClearPageReserved(p);
 		set_page_count(p, 0);
-		p->first_page = page;
-		/* Make sure p->first_page is always valid for PageTail() */
-		smp_wmb();
-		__SetPageTail(p);
+		set_compound_head(p, page);
 	}
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..89e21a07080a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
 	 * speculative page access (like in
 	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
+	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
 	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
+		atomic_inc(&compound_head(page)->_count);
 	get_huge_page_tail(page);
 }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1f4446a90cef..4d1a5de9653d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 #define lru		(1UL << PG_lru)
 #define swapbacked	(1UL << PG_swapbacked)
 #define head		(1UL << PG_head)
-#define tail		(1UL << PG_tail)
-#define compound	(1UL << PG_compound)
 #define slab		(1UL << PG_slab)
 #define reserved	(1UL << PG_reserved)
 
@@ -811,12 +809,7 @@ static struct page_state {
 	 */
 	{ slab,		slab,		MF_MSG_SLAB,	me_kernel },
 
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{ head,		head,		MF_MSG_HUGE,		me_huge_page },
-	{ tail,		tail,		MF_MSG_HUGE,		me_huge_page },
-#else
-	{ compound,	compound,	MF_MSG_HUGE,		me_huge_page },
-#endif
 
 	{ sc|dirty,	sc|dirty,	MF_MSG_DIRTY_SWAPCACHE,	me_swapcache_dirty },
 	{ sc|dirty,	sc,		MF_MSG_CLEAN_SWAPCACHE,	me_swapcache_clean },
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6733cc3cbce..78859d47aaf4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -424,15 +424,15 @@ out:
 /*
  * Higher-order pages are called "compound pages".  They are structured thusly:
  *
- * The first PAGE_SIZE page is called the "head page".
+ * The first PAGE_SIZE page is called the "head page" and have PG_head set.
  *
- * The remaining PAGE_SIZE pages are called "tail pages".
+ * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
+ * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
  *
- * All pages have PG_compound set.  All tail pages have their ->first_page
- * pointing at the head page.
+ * The first tail page's ->compound_dtor holds the offset in array of compound
+ * page destructors. See compound_page_dtors.
  *
- * The first tail page's ->lru.next holds the address of the compound page's
- * put_page() function.  Its ->lru.prev holds the order of allocation.
+ * The first tail page's ->compound_order holds the order of allocation.
  * This usage means that zero-order pages may not be compound.
  */
 
@@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 		set_page_count(p, 0);
-		p->first_page = page;
-		/* Make sure p->first_page is always valid for PageTail() */
-		smp_wmb();
-		__SetPageTail(p);
+		set_compound_head(p, page);
 	}
 }
 
@@ -830,17 +827,24 @@ static void free_one_page(struct zone *zone,
 
 static int free_tail_pages_check(struct page *head_page, struct page *page)
 {
-	if (!IS_ENABLED(CONFIG_DEBUG_VM))
-		return 0;
+	int ret = 1;
+
+	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
+		ret = 0;
+		goto out;
+	}
 	if (unlikely(!PageTail(page))) {
 		bad_page(page, "PageTail not set", 0);
-		return 1;
+		goto out;
 	}
-	if (unlikely(page->first_page != head_page)) {
-		bad_page(page, "first_page not consistent", 0);
-		return 1;
+	if (unlikely(compound_head(page) != head_page)) {
+		bad_page(page, "compound_head not consistent", 0);
+		goto out;
 	}
-	return 0;
+	ret = 0;
+out:
+	clear_compound_head(page);
+	return ret;
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
diff --git a/mm/swap.c b/mm/swap.c
index a3a0a2f1f7c3..faa9e1687dea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,7 +200,7 @@ out_put_single:
 				__put_single_page(page);
 			return;
 		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
+		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
 		/*
 		 * We can release the refcount taken by
 		 * get_page_unless_zero() now that
@@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
 	 *  Case 3 is possible, as we may race with
 	 *  __split_huge_page_refcount tearing down a THP page.
 	 */
-	page_head = compound_head_by_tail(page);
+	page_head = compound_head(page);
 	if (!__compound_tail_refcounted(page_head))
 		put_unrefcounted_compound_page(page_head, page);
 	else
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

	CPU0					CPU1

isolate_migratepages_block()
  page_count()
    compound_head()
      !!PageTail() == true
					put_page()
					  tail->first_page = NULL
      head = tail->first_page
					alloc_pages(__GFP_COMP)
					   prep_compound_page()
					     tail->first_page = head
					     __SetPageTail(p);
      !!PageTail() == true
    <head == NULL dereferencing>

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. That means it shares storage
space with:

 - page->lru.next;
 - page->next;
 - page->rcu_head.next;
 - page->pmd_huge_pte;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word. It can be used to encode PageTail(). And if the bit
set, rest of the word is pointer to head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
---
 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig      |  1 -
 include/linux/mm.h                     | 53 ++--------------------
 include/linux/mm_types.h               |  9 +++-
 include/linux/page-flags.h             | 80 ++++++++--------------------------
 mm/Kconfig                             | 12 -----
 mm/debug.c                             |  5 ---
 mm/huge_memory.c                       |  3 +-
 mm/hugetlb.c                           |  8 +---
 mm/internal.h                          |  4 +-
 mm/memory-failure.c                    |  7 ---
 mm/page_alloc.c                        | 38 ++++++++--------
 mm/swap.c                              |  4 +-
 13 files changed, 58 insertions(+), 170 deletions(-)

diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock
index 6dea4fd5c961..62842a857dab 100644
--- a/Documentation/vm/split_page_table_lock
+++ b/Documentation/vm/split_page_table_lock
@@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
 which must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
-allocation: slab uses page->slab_cache and page->first_page for its pages.
-These fields share storage with page->ptl.
+allocation: slab uses page->slab_cache for its pages.
+This field shares storage with page->ptl.
 
 PMD split lock only makes sense if you have more than two page table
 levels.
diff --git a/arch/xtensa/configs/iss_defconfig b/arch/xtensa/configs/iss_defconfig
index e4d193e7a300..5c7c385f21c4 100644
--- a/arch/xtensa/configs/iss_defconfig
+++ b/arch/xtensa/configs/iss_defconfig
@@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
 # CONFIG_SPARSEMEM_MANUAL is not set
 CONFIG_FLATMEM=y
 CONFIG_FLAT_NODE_MEM_MAP=y
-CONFIG_PAGEFLAGS_EXTENDED=y
 CONFIG_SPLIT_PTLOCK_CPUS=4
 # CONFIG_PHYS_ADDR_T_64BIT is not set
 CONFIG_ZONE_DMA_FLAG=1
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0735bc0a351a..a4c4b7d07473 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct page *page,
 #endif
 }
 
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
-	struct page *head = tail->first_page;
-
-	/*
-	 * page->first_page may be a dangling pointer to an old
-	 * compound page, so recheck that it is still a tail
-	 * page before returning.
-	 */
-	smp_rmb();
-	if (likely(PageTail(tail)))
-		return head;
-	return tail;
-}
-
-/*
- * Since either compound page could be dismantled asynchronously in THP
- * or we access asynchronously arbitrary positioned struct page, there
- * would be tail flag race. To handle this race, we should call
- * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
- */
-static inline struct page *compound_head(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return compound_head_by_tail(page);
-	return page;
-}
-
-/*
- * If we access compound page synchronously such as access to
- * allocated page, there is no need to handle tail flag race, so we can
- * check tail flag directly without any synchronization primitive.
- */
-static inline struct page *compound_head_fast(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return page->first_page;
-	return page;
-}
-
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -525,7 +485,7 @@ static inline void get_huge_page_tail(struct page *page)
 	VM_BUG_ON_PAGE(!PageTail(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
+	if (compound_tail_refcounted(compound_head(page)))
 		atomic_inc(&page->_mapcount);
 }
 
@@ -548,13 +508,7 @@ static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
 
-	/*
-	 * We don't need to worry about synchronization of tail flag
-	 * when we call virt_to_head_page() since it is only called for
-	 * already allocated page and this page won't be freed until
-	 * this virt_to_head_page() is finished. So use _fast variant.
-	 */
-	return compound_head_fast(page);
+	return compound_head(page);
 }
 
 /*
@@ -1496,8 +1450,7 @@ static inline bool ptlock_init(struct page *page)
 	 * with 0. Make sure nobody took it in use in between.
 	 *
 	 * It can happen if arch try to use slab for page table allocation:
-	 * slab code uses page->slab_cache and page->first_page (for tail
-	 * pages), which share storage with page->ptl.
+	 * slab code uses page->slab_cache, which share storage with page->ptl.
 	 */
 	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
 	if (!ptlock_alloc(page))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 63cdfe7ec336..e324768b6cc7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,7 +120,12 @@ struct page {
 		};
 	};
 
-	/* Third double word block */
+	/*
+	 * Third double word block
+	 *
+	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
+	 * for non-tail pages.
+	 */
 	union {
 		struct list_head lru;	/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
@@ -143,6 +148,7 @@ struct page {
 						 */
 		/* First tail page of compound page */
 		struct {
+			unsigned long compound_head; /* If bit zero is set */
 #ifdef CONFIG_64BIT
 			unsigned int compound_dtor;
 			unsigned int compound_order;
@@ -174,7 +180,6 @@ struct page {
 #endif
 #endif
 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
-		struct page *first_page;	/* Compound tail pages */
 	};
 
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 41c93844fb1d..9b865158e452 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -86,12 +86,7 @@ enum pageflags {
 	PG_private,		/* If pagecache, has fs-private data */
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	PG_head,		/* A head page */
-	PG_tail,		/* A tail page */
-#else
-	PG_compound,		/* A compound page */
-#endif
 	PG_swapcache,		/* Swap page: swp_entry_t in private */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
@@ -387,85 +382,46 @@ static inline void set_page_writeback_keepwrite(struct page *page)
 	test_set_page_writeback_keepwrite(page);
 }
 
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
-/*
- * System with lots of page flags available. This allows separate
- * flags for PageHead() and PageTail() checks of compound pages so that bit
- * tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths except arch/powerpc/mm/init_64.c
- * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
- * and avoid handling those in real mode.
- */
 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
-__PAGEFLAG(Tail, tail)
 
-static inline int PageCompound(struct page *page)
-{
-	return page->flags & ((1L << PG_head) | (1L << PG_tail));
-
-}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline void ClearPageCompound(struct page *page)
+static inline int PageTail(struct page *page)
 {
-	BUG_ON(!PageHead(page));
-	ClearPageHead(page);
+	return READ_ONCE(page->compound_head) & 1;
 }
-#endif
-
-#define PG_head_mask ((1L << PG_head))
 
-#else
-/*
- * Reduce page flag use as much as possible by overlapping
- * compound page flags with the flags used for page cache pages. Possible
- * because PageCompound is always set for compound pages and not for
- * pages on the LRU and/or pagecache.
- */
-TESTPAGEFLAG(Compound, compound)
-__SETPAGEFLAG(Head, compound)  __CLEARPAGEFLAG(Head, compound)
-
-/*
- * PG_reclaim is used in combination with PG_compound to mark the
- * head and tail of a compound page. This saves one page flag
- * but makes it impossible to use compound pages for the page cache.
- * The PG_reclaim bit would have to be used for reclaim or readahead
- * if compound pages enter the page cache.
- *
- * PG_compound & PG_reclaim	=> Tail page
- * PG_compound & ~PG_reclaim	=> Head page
- */
-#define PG_head_mask ((1L << PG_compound))
-#define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
-
-static inline int PageHead(struct page *page)
+static inline void set_compound_head(struct page *page, struct page *head)
 {
-	return ((page->flags & PG_head_tail_mask) == PG_head_mask);
+	WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
 }
 
-static inline int PageTail(struct page *page)
+static inline void clear_compound_head(struct page *page)
 {
-	return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask);
+	WRITE_ONCE(page->compound_head, 0);
 }
 
-static inline void __SetPageTail(struct page *page)
+static inline struct page *compound_head(struct page *page)
 {
-	page->flags |= PG_head_tail_mask;
+	unsigned long head = READ_ONCE(page->compound_head);
+
+	if (unlikely(head & 1))
+		return (struct page *) (head - 1);
+	return page;
 }
 
-static inline void __ClearPageTail(struct page *page)
+static inline int PageCompound(struct page *page)
 {
-	page->flags &= ~PG_head_tail_mask;
-}
+	return PageHead(page) || PageTail(page);
 
+}
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void ClearPageCompound(struct page *page)
 {
-	BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
-	clear_bit(PG_compound, &page->flags);
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
 }
 #endif
 
-#endif /* !PAGEFLAGS_EXTENDED */
+#define PG_head_mask ((1L << PG_head))
 
 #ifdef CONFIG_HUGETLB_PAGE
 int PageHuge(struct page *page);
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..454579d31081 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,18 +200,6 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
-#
-# If we have space for more page flags then we can enable additional
-# optimizations and functionality.
-#
-# Regular Sparsemem takes page flag bits for the sectionid if it does not
-# use a virtual memmap. Disable extended page flags for 32 bit platforms
-# that require the use of a sectionid in the page flags.
-#
-config PAGEFLAGS_EXTENDED
-	def_bool y
-	depends on 64BIT || SPARSEMEM_VMEMMAP || !SPARSEMEM
-
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
diff --git a/mm/debug.c b/mm/debug.c
index 76089ddf99ea..205e5ef957ab 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,12 +25,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private,		"private"	},
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{1UL << PG_head,		"head"		},
-	{1UL << PG_tail,		"tail"		},
-#else
-	{1UL << PG_compound,		"compound"	},
-#endif
 	{1UL << PG_swapcache,		"swapcache"	},
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},
 	{1UL << PG_reclaim,		"reclaim"	},
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 097c7a4bfbd9..330377f83ac7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
 				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/* clear PageTail before overwriting first_page */
-		smp_wmb();
+		clear_compound_head(page_tail);
 
 		/*
 		 * __split_huge_page_splitting() already set the
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ea74caa1fa8..53c0709fd87b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -824,9 +824,8 @@ static void destroy_compound_gigantic_page(struct page *page,
 	struct page *p = page + 1;
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		__ClearPageTail(p);
+		clear_compound_head(p);
 		set_page_refcounted(p);
-		p->first_page = NULL;
 	}
 
 	set_compound_order(page, 0);
@@ -1099,10 +1098,7 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order)
 		 */
 		__ClearPageReserved(p);
 		set_page_count(p, 0);
-		p->first_page = page;
-		/* Make sure p->first_page is always valid for PageTail() */
-		smp_wmb();
-		__SetPageTail(p);
+		set_compound_head(p, page);
 	}
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..89e21a07080a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
 	 * speculative page access (like in
 	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
+	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
 	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
+		atomic_inc(&compound_head(page)->_count);
 	get_huge_page_tail(page);
 }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1f4446a90cef..4d1a5de9653d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 #define lru		(1UL << PG_lru)
 #define swapbacked	(1UL << PG_swapbacked)
 #define head		(1UL << PG_head)
-#define tail		(1UL << PG_tail)
-#define compound	(1UL << PG_compound)
 #define slab		(1UL << PG_slab)
 #define reserved	(1UL << PG_reserved)
 
@@ -811,12 +809,7 @@ static struct page_state {
 	 */
 	{ slab,		slab,		MF_MSG_SLAB,	me_kernel },
 
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{ head,		head,		MF_MSG_HUGE,		me_huge_page },
-	{ tail,		tail,		MF_MSG_HUGE,		me_huge_page },
-#else
-	{ compound,	compound,	MF_MSG_HUGE,		me_huge_page },
-#endif
 
 	{ sc|dirty,	sc|dirty,	MF_MSG_DIRTY_SWAPCACHE,	me_swapcache_dirty },
 	{ sc|dirty,	sc,		MF_MSG_CLEAN_SWAPCACHE,	me_swapcache_clean },
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6733cc3cbce..78859d47aaf4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -424,15 +424,15 @@ out:
 /*
  * Higher-order pages are called "compound pages".  They are structured thusly:
  *
- * The first PAGE_SIZE page is called the "head page".
+ * The first PAGE_SIZE page is called the "head page" and have PG_head set.
  *
- * The remaining PAGE_SIZE pages are called "tail pages".
+ * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
+ * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
  *
- * All pages have PG_compound set.  All tail pages have their ->first_page
- * pointing at the head page.
+ * The first tail page's ->compound_dtor holds the offset in array of compound
+ * page destructors. See compound_page_dtors.
  *
- * The first tail page's ->lru.next holds the address of the compound page's
- * put_page() function.  Its ->lru.prev holds the order of allocation.
+ * The first tail page's ->compound_order holds the order of allocation.
  * This usage means that zero-order pages may not be compound.
  */
 
@@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 		set_page_count(p, 0);
-		p->first_page = page;
-		/* Make sure p->first_page is always valid for PageTail() */
-		smp_wmb();
-		__SetPageTail(p);
+		set_compound_head(p, page);
 	}
 }
 
@@ -830,17 +827,24 @@ static void free_one_page(struct zone *zone,
 
 static int free_tail_pages_check(struct page *head_page, struct page *page)
 {
-	if (!IS_ENABLED(CONFIG_DEBUG_VM))
-		return 0;
+	int ret = 1;
+
+	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
+		ret = 0;
+		goto out;
+	}
 	if (unlikely(!PageTail(page))) {
 		bad_page(page, "PageTail not set", 0);
-		return 1;
+		goto out;
 	}
-	if (unlikely(page->first_page != head_page)) {
-		bad_page(page, "first_page not consistent", 0);
-		return 1;
+	if (unlikely(compound_head(page) != head_page)) {
+		bad_page(page, "compound_head not consistent", 0);
+		goto out;
 	}
-	return 0;
+	ret = 0;
+out:
+	clear_compound_head(page);
+	return ret;
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
diff --git a/mm/swap.c b/mm/swap.c
index a3a0a2f1f7c3..faa9e1687dea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,7 +200,7 @@ out_put_single:
 				__put_single_page(page);
 			return;
 		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
+		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
 		/*
 		 * We can release the refcount taken by
 		 * get_page_unless_zero() now that
@@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
 	 *  Case 3 is possible, as we may race with
 	 *  __split_huge_page_refcount tearing down a THP page.
 	 */
-	page_head = compound_head_by_tail(page);
+	page_head = compound_head(page);
 	if (!__compound_tail_refcounted(page_head))
 		put_unrefcounted_compound_page(page_head, page);
 	else
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 5/5] mm: use 'unsigned int' for page order
  2015-08-19  9:21 ` Kirill A. Shutemov
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

Let's try to be consistent about data type of page order.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |  5 +++--
 mm/hugetlb.c       | 19 ++++++++++---------
 mm/internal.h      |  4 ++--
 mm/page_alloc.c    | 27 +++++++++++++++------------
 4 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a4c4b7d07473..a75bbb3f7142 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -557,7 +557,7 @@ static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 	return compound_page_dtors[page[1].compound_dtor];
 }
 
-static inline int compound_order(struct page *page)
+static inline unsigned int compound_order(struct page *page)
 {
 	if (!PageHead(page))
 		return 0;
@@ -1718,7 +1718,8 @@ extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
 extern __printf(3, 4)
-void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
+void warn_alloc_failed(gfp_t gfp_mask, unsigned int order,
+		const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 53c0709fd87b..bf64bfebc473 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -817,7 +817,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 
 #if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
 static void destroy_compound_gigantic_page(struct page *page,
-					unsigned long order)
+					unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -832,7 +832,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned order)
+static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	free_contig_range(page_to_pfn(page), 1 << order);
 }
@@ -876,7 +876,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
 	return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(int nid, unsigned order)
+static struct page *alloc_gigantic_page(int nid, unsigned int order)
 {
 	unsigned long nr_pages = 1 << order;
 	unsigned long ret, pfn, flags;
@@ -912,7 +912,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
-static void prep_compound_gigantic_page(struct page *page, unsigned long order);
+static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 
 static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
 {
@@ -945,9 +945,9 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
 static inline bool gigantic_page_supported(void) { return true; }
 #else
 static inline bool gigantic_page_supported(void) { return false; }
-static inline void free_gigantic_page(struct page *page, unsigned order) { }
+static inline void free_gigantic_page(struct page *page, unsigned int order) { }
 static inline void destroy_compound_gigantic_page(struct page *page,
-						unsigned long order) { }
+						unsigned int order) { }
 static inline int alloc_fresh_gigantic_page(struct hstate *h,
 					nodemask_t *nodes_allowed) { return 0; }
 #endif
@@ -1073,7 +1073,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static void prep_compound_gigantic_page(struct page *page, unsigned long order)
+static void prep_compound_gigantic_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1640,7 +1640,8 @@ found:
 	return 1;
 }
 
-static void __init prep_compound_huge_page(struct page *page, int order)
+static void __init prep_compound_huge_page(struct page *page,
+		unsigned int order)
 {
 	if (unlikely(order > (MAX_ORDER - 1)))
 		prep_compound_gigantic_page(page, order);
@@ -2351,7 +2352,7 @@ static int __init hugetlb_init(void)
 module_init(hugetlb_init);
 
 /* Should be called on processing a hugepagesz=... option */
-void __init hugetlb_add_hstate(unsigned order)
+void __init hugetlb_add_hstate(unsigned int order)
 {
 	struct hstate *h;
 	unsigned long i;
diff --git a/mm/internal.h b/mm/internal.h
index 89e21a07080a..9a9fc497593f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,7 +157,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
 extern int __isolate_free_page(struct page *page, unsigned int order);
 extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
 					unsigned int order);
-extern void prep_compound_page(struct page *page, unsigned long order);
+extern void prep_compound_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -214,7 +214,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
  * page cannot be allocated or merged in parallel. Alternatively, it must
  * handle invalid values gracefully, and use page_order_unsafe() below.
  */
-static inline unsigned long page_order(struct page *page)
+static inline unsigned int page_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
 	return page_private(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78859d47aaf4..347724850665 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -163,7 +163,7 @@ bool pm_suspended_storage(void)
 #endif /* CONFIG_PM_SLEEP */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
-int pageblock_order __read_mostly;
+unsigned int pageblock_order __read_mostly;
 #endif
 
 static void __free_pages_ok(struct page *page, unsigned int order);
@@ -441,7 +441,7 @@ static void free_compound_page(struct page *page)
 	__free_pages_ok(page, compound_order(page));
 }
 
-void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -641,7 +641,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_idx;
 	unsigned long uninitialized_var(buddy_idx);
 	struct page *buddy;
-	int max_order = MAX_ORDER;
+	unsigned int max_order = MAX_ORDER;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1436,7 +1436,7 @@ int move_freepages(struct zone *zone,
 			  int migratetype)
 {
 	struct page *page;
-	unsigned long order;
+	unsigned int order;
 	int pages_moved = 0;
 
 #ifndef CONFIG_HOLES_IN_ZONE
@@ -1550,7 +1550,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 static void steal_suitable_fallback(struct zone *zone, struct page *page,
 							  int start_type)
 {
-	int current_order = page_order(page);
+	unsigned int current_order = page_order(page);
 	int pages;
 
 	/* Take ownership for orders >= pageblock_order */
@@ -2657,7 +2657,7 @@ static DEFINE_RATELIMIT_STATE(nopage_rs,
 		DEFAULT_RATELIMIT_INTERVAL,
 		DEFAULT_RATELIMIT_BURST);
 
-void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
+void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
 {
 	unsigned int filter = SHOW_MEM_FILTER_NODES;
 
@@ -2691,7 +2691,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		va_end(args);
 	}
 
-	pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n",
+	pr_warn("%s: page allocation failure: order:%u, mode:0x%x\n",
 		current->comm, order, gfp_mask);
 
 	dump_stack();
@@ -3450,7 +3450,8 @@ void free_kmem_pages(unsigned long addr, unsigned int order)
 	}
 }
 
-static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
+static void *make_alloc_exact(unsigned long addr, unsigned int order,
+		size_t size)
 {
 	if (addr) {
 		unsigned long alloc_end = addr + (PAGE_SIZE << order);
@@ -3502,7 +3503,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
  */
 void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
 {
-	unsigned order = get_order(size);
+	unsigned int order = get_order(size);
 	struct page *p = alloc_pages_node(nid, gfp_mask, order);
 	if (!p)
 		return NULL;
@@ -3804,7 +3805,8 @@ void show_free_areas(unsigned int filter)
 	}
 
 	for_each_populated_zone(zone) {
-		unsigned long nr[MAX_ORDER], flags, order, total = 0;
+		unsigned int order;
+		unsigned long nr[MAX_ORDER], flags, total = 0;
 		unsigned char types[MAX_ORDER];
 
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -4153,7 +4155,7 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
-	int order = current_zonelist_order;
+	unsigned int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -6818,7 +6820,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype)
 {
 	unsigned long outer_start, outer_end;
-	int ret = 0, order;
+	unsigned int order;
+	int ret = 0;
 
 	struct compact_control cc = {
 		.nr_migratepages = 0,
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCHv3 5/5] mm: use 'unsigned int' for page order
@ 2015-08-19  9:21   ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-19  9:21 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Kirill A. Shutemov

Let's try to be consistent about data type of page order.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |  5 +++--
 mm/hugetlb.c       | 19 ++++++++++---------
 mm/internal.h      |  4 ++--
 mm/page_alloc.c    | 27 +++++++++++++++------------
 4 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a4c4b7d07473..a75bbb3f7142 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -557,7 +557,7 @@ static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 	return compound_page_dtors[page[1].compound_dtor];
 }
 
-static inline int compound_order(struct page *page)
+static inline unsigned int compound_order(struct page *page)
 {
 	if (!PageHead(page))
 		return 0;
@@ -1718,7 +1718,8 @@ extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
 extern __printf(3, 4)
-void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
+void warn_alloc_failed(gfp_t gfp_mask, unsigned int order,
+		const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 53c0709fd87b..bf64bfebc473 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -817,7 +817,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 
 #if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
 static void destroy_compound_gigantic_page(struct page *page,
-					unsigned long order)
+					unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -832,7 +832,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned order)
+static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	free_contig_range(page_to_pfn(page), 1 << order);
 }
@@ -876,7 +876,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
 	return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(int nid, unsigned order)
+static struct page *alloc_gigantic_page(int nid, unsigned int order)
 {
 	unsigned long nr_pages = 1 << order;
 	unsigned long ret, pfn, flags;
@@ -912,7 +912,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
-static void prep_compound_gigantic_page(struct page *page, unsigned long order);
+static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 
 static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
 {
@@ -945,9 +945,9 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
 static inline bool gigantic_page_supported(void) { return true; }
 #else
 static inline bool gigantic_page_supported(void) { return false; }
-static inline void free_gigantic_page(struct page *page, unsigned order) { }
+static inline void free_gigantic_page(struct page *page, unsigned int order) { }
 static inline void destroy_compound_gigantic_page(struct page *page,
-						unsigned long order) { }
+						unsigned int order) { }
 static inline int alloc_fresh_gigantic_page(struct hstate *h,
 					nodemask_t *nodes_allowed) { return 0; }
 #endif
@@ -1073,7 +1073,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static void prep_compound_gigantic_page(struct page *page, unsigned long order)
+static void prep_compound_gigantic_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1640,7 +1640,8 @@ found:
 	return 1;
 }
 
-static void __init prep_compound_huge_page(struct page *page, int order)
+static void __init prep_compound_huge_page(struct page *page,
+		unsigned int order)
 {
 	if (unlikely(order > (MAX_ORDER - 1)))
 		prep_compound_gigantic_page(page, order);
@@ -2351,7 +2352,7 @@ static int __init hugetlb_init(void)
 module_init(hugetlb_init);
 
 /* Should be called on processing a hugepagesz=... option */
-void __init hugetlb_add_hstate(unsigned order)
+void __init hugetlb_add_hstate(unsigned int order)
 {
 	struct hstate *h;
 	unsigned long i;
diff --git a/mm/internal.h b/mm/internal.h
index 89e21a07080a..9a9fc497593f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,7 +157,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
 extern int __isolate_free_page(struct page *page, unsigned int order);
 extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
 					unsigned int order);
-extern void prep_compound_page(struct page *page, unsigned long order);
+extern void prep_compound_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -214,7 +214,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
  * page cannot be allocated or merged in parallel. Alternatively, it must
  * handle invalid values gracefully, and use page_order_unsafe() below.
  */
-static inline unsigned long page_order(struct page *page)
+static inline unsigned int page_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
 	return page_private(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78859d47aaf4..347724850665 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -163,7 +163,7 @@ bool pm_suspended_storage(void)
 #endif /* CONFIG_PM_SLEEP */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
-int pageblock_order __read_mostly;
+unsigned int pageblock_order __read_mostly;
 #endif
 
 static void __free_pages_ok(struct page *page, unsigned int order);
@@ -441,7 +441,7 @@ static void free_compound_page(struct page *page)
 	__free_pages_ok(page, compound_order(page));
 }
 
-void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -641,7 +641,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_idx;
 	unsigned long uninitialized_var(buddy_idx);
 	struct page *buddy;
-	int max_order = MAX_ORDER;
+	unsigned int max_order = MAX_ORDER;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1436,7 +1436,7 @@ int move_freepages(struct zone *zone,
 			  int migratetype)
 {
 	struct page *page;
-	unsigned long order;
+	unsigned int order;
 	int pages_moved = 0;
 
 #ifndef CONFIG_HOLES_IN_ZONE
@@ -1550,7 +1550,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 static void steal_suitable_fallback(struct zone *zone, struct page *page,
 							  int start_type)
 {
-	int current_order = page_order(page);
+	unsigned int current_order = page_order(page);
 	int pages;
 
 	/* Take ownership for orders >= pageblock_order */
@@ -2657,7 +2657,7 @@ static DEFINE_RATELIMIT_STATE(nopage_rs,
 		DEFAULT_RATELIMIT_INTERVAL,
 		DEFAULT_RATELIMIT_BURST);
 
-void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
+void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
 {
 	unsigned int filter = SHOW_MEM_FILTER_NODES;
 
@@ -2691,7 +2691,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		va_end(args);
 	}
 
-	pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n",
+	pr_warn("%s: page allocation failure: order:%u, mode:0x%x\n",
 		current->comm, order, gfp_mask);
 
 	dump_stack();
@@ -3450,7 +3450,8 @@ void free_kmem_pages(unsigned long addr, unsigned int order)
 	}
 }
 
-static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
+static void *make_alloc_exact(unsigned long addr, unsigned int order,
+		size_t size)
 {
 	if (addr) {
 		unsigned long alloc_end = addr + (PAGE_SIZE << order);
@@ -3502,7 +3503,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
  */
 void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
 {
-	unsigned order = get_order(size);
+	unsigned int order = get_order(size);
 	struct page *p = alloc_pages_node(nid, gfp_mask, order);
 	if (!p)
 		return NULL;
@@ -3804,7 +3805,8 @@ void show_free_areas(unsigned int filter)
 	}
 
 	for_each_populated_zone(zone) {
-		unsigned long nr[MAX_ORDER], flags, order, total = 0;
+		unsigned int order;
+		unsigned long nr[MAX_ORDER], flags, total = 0;
 		unsigned char types[MAX_ORDER];
 
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -4153,7 +4155,7 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
-	int order = current_zonelist_order;
+	unsigned int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -6818,7 +6820,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype)
 {
 	unsigned long outer_start, outer_end;
-	int ret = 0, order;
+	unsigned int order;
+	int ret = 0;
 
 	struct compact_control cc = {
 		.nr_migratepages = 0,
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 5/5] mm: use 'unsigned int' for page order
  2015-08-19  9:21   ` Kirill A. Shutemov
@ 2015-08-20  8:32     ` Michal Hocko
  -1 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-20  8:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, David Rientjes, linux-kernel,
	linux-mm

On Wed 19-08-15 12:21:46, Kirill A. Shutemov wrote:
> Let's try to be consistent about data type of page order.

Looks good to me.

We still have *_control::order but that is not directly related to this
patch series.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/mm.h |  5 +++--
>  mm/hugetlb.c       | 19 ++++++++++---------
>  mm/internal.h      |  4 ++--
>  mm/page_alloc.c    | 27 +++++++++++++++------------
>  4 files changed, 30 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a4c4b7d07473..a75bbb3f7142 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -557,7 +557,7 @@ static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
>  	return compound_page_dtors[page[1].compound_dtor];
>  }
>  
> -static inline int compound_order(struct page *page)
> +static inline unsigned int compound_order(struct page *page)
>  {
>  	if (!PageHead(page))
>  		return 0;
> @@ -1718,7 +1718,8 @@ extern void si_meminfo(struct sysinfo * val);
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
>  
>  extern __printf(3, 4)
> -void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
> +void warn_alloc_failed(gfp_t gfp_mask, unsigned int order,
> +		const char *fmt, ...);
>  
>  extern void setup_per_cpu_pageset(void);
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 53c0709fd87b..bf64bfebc473 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -817,7 +817,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  
>  #if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
>  static void destroy_compound_gigantic_page(struct page *page,
> -					unsigned long order)
> +					unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -832,7 +832,7 @@ static void destroy_compound_gigantic_page(struct page *page,
>  	__ClearPageHead(page);
>  }
>  
> -static void free_gigantic_page(struct page *page, unsigned order)
> +static void free_gigantic_page(struct page *page, unsigned int order)
>  {
>  	free_contig_range(page_to_pfn(page), 1 << order);
>  }
> @@ -876,7 +876,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
>  	return zone_spans_pfn(zone, last_pfn);
>  }
>  
> -static struct page *alloc_gigantic_page(int nid, unsigned order)
> +static struct page *alloc_gigantic_page(int nid, unsigned int order)
>  {
>  	unsigned long nr_pages = 1 << order;
>  	unsigned long ret, pfn, flags;
> @@ -912,7 +912,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
>  }
>  
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
> -static void prep_compound_gigantic_page(struct page *page, unsigned long order);
> +static void prep_compound_gigantic_page(struct page *page, unsigned int order);
>  
>  static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
>  {
> @@ -945,9 +945,9 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
>  static inline bool gigantic_page_supported(void) { return true; }
>  #else
>  static inline bool gigantic_page_supported(void) { return false; }
> -static inline void free_gigantic_page(struct page *page, unsigned order) { }
> +static inline void free_gigantic_page(struct page *page, unsigned int order) { }
>  static inline void destroy_compound_gigantic_page(struct page *page,
> -						unsigned long order) { }
> +						unsigned int order) { }
>  static inline int alloc_fresh_gigantic_page(struct hstate *h,
>  					nodemask_t *nodes_allowed) { return 0; }
>  #endif
> @@ -1073,7 +1073,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  	put_page(page); /* free it into the hugepage allocator */
>  }
>  
> -static void prep_compound_gigantic_page(struct page *page, unsigned long order)
> +static void prep_compound_gigantic_page(struct page *page, unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -1640,7 +1640,8 @@ found:
>  	return 1;
>  }
>  
> -static void __init prep_compound_huge_page(struct page *page, int order)
> +static void __init prep_compound_huge_page(struct page *page,
> +		unsigned int order)
>  {
>  	if (unlikely(order > (MAX_ORDER - 1)))
>  		prep_compound_gigantic_page(page, order);
> @@ -2351,7 +2352,7 @@ static int __init hugetlb_init(void)
>  module_init(hugetlb_init);
>  
>  /* Should be called on processing a hugepagesz=... option */
> -void __init hugetlb_add_hstate(unsigned order)
> +void __init hugetlb_add_hstate(unsigned int order)
>  {
>  	struct hstate *h;
>  	unsigned long i;
> diff --git a/mm/internal.h b/mm/internal.h
> index 89e21a07080a..9a9fc497593f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -157,7 +157,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
>  extern int __isolate_free_page(struct page *page, unsigned int order);
>  extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
>  					unsigned int order);
> -extern void prep_compound_page(struct page *page, unsigned long order);
> +extern void prep_compound_page(struct page *page, unsigned int order);
>  #ifdef CONFIG_MEMORY_FAILURE
>  extern bool is_free_buddy_page(struct page *page);
>  #endif
> @@ -214,7 +214,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
>   * page cannot be allocated or merged in parallel. Alternatively, it must
>   * handle invalid values gracefully, and use page_order_unsafe() below.
>   */
> -static inline unsigned long page_order(struct page *page)
> +static inline unsigned int page_order(struct page *page)
>  {
>  	/* PageBuddy() must be checked by the caller */
>  	return page_private(page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78859d47aaf4..347724850665 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -163,7 +163,7 @@ bool pm_suspended_storage(void)
>  #endif /* CONFIG_PM_SLEEP */
>  
>  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> -int pageblock_order __read_mostly;
> +unsigned int pageblock_order __read_mostly;
>  #endif
>  
>  static void __free_pages_ok(struct page *page, unsigned int order);
> @@ -441,7 +441,7 @@ static void free_compound_page(struct page *page)
>  	__free_pages_ok(page, compound_order(page));
>  }
>  
> -void prep_compound_page(struct page *page, unsigned long order)
> +void prep_compound_page(struct page *page, unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -641,7 +641,7 @@ static inline void __free_one_page(struct page *page,
>  	unsigned long combined_idx;
>  	unsigned long uninitialized_var(buddy_idx);
>  	struct page *buddy;
> -	int max_order = MAX_ORDER;
> +	unsigned int max_order = MAX_ORDER;
>  
>  	VM_BUG_ON(!zone_is_initialized(zone));
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1436,7 +1436,7 @@ int move_freepages(struct zone *zone,
>  			  int migratetype)
>  {
>  	struct page *page;
> -	unsigned long order;
> +	unsigned int order;
>  	int pages_moved = 0;
>  
>  #ifndef CONFIG_HOLES_IN_ZONE
> @@ -1550,7 +1550,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
>  static void steal_suitable_fallback(struct zone *zone, struct page *page,
>  							  int start_type)
>  {
> -	int current_order = page_order(page);
> +	unsigned int current_order = page_order(page);
>  	int pages;
>  
>  	/* Take ownership for orders >= pageblock_order */
> @@ -2657,7 +2657,7 @@ static DEFINE_RATELIMIT_STATE(nopage_rs,
>  		DEFAULT_RATELIMIT_INTERVAL,
>  		DEFAULT_RATELIMIT_BURST);
>  
> -void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> +void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
>  {
>  	unsigned int filter = SHOW_MEM_FILTER_NODES;
>  
> @@ -2691,7 +2691,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  		va_end(args);
>  	}
>  
> -	pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n",
> +	pr_warn("%s: page allocation failure: order:%u, mode:0x%x\n",
>  		current->comm, order, gfp_mask);
>  
>  	dump_stack();
> @@ -3450,7 +3450,8 @@ void free_kmem_pages(unsigned long addr, unsigned int order)
>  	}
>  }
>  
> -static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
> +static void *make_alloc_exact(unsigned long addr, unsigned int order,
> +		size_t size)
>  {
>  	if (addr) {
>  		unsigned long alloc_end = addr + (PAGE_SIZE << order);
> @@ -3502,7 +3503,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
>   */
>  void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
>  {
> -	unsigned order = get_order(size);
> +	unsigned int order = get_order(size);
>  	struct page *p = alloc_pages_node(nid, gfp_mask, order);
>  	if (!p)
>  		return NULL;
> @@ -3804,7 +3805,8 @@ void show_free_areas(unsigned int filter)
>  	}
>  
>  	for_each_populated_zone(zone) {
> -		unsigned long nr[MAX_ORDER], flags, order, total = 0;
> +		unsigned int order;
> +		unsigned long nr[MAX_ORDER], flags, total = 0;
>  		unsigned char types[MAX_ORDER];
>  
>  		if (skip_free_areas_node(filter, zone_to_nid(zone)))
> @@ -4153,7 +4155,7 @@ static void build_zonelists(pg_data_t *pgdat)
>  	nodemask_t used_mask;
>  	int local_node, prev_node;
>  	struct zonelist *zonelist;
> -	int order = current_zonelist_order;
> +	unsigned int order = current_zonelist_order;
>  
>  	/* initialize zonelists */
>  	for (i = 0; i < MAX_ZONELISTS; i++) {
> @@ -6818,7 +6820,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  		       unsigned migratetype)
>  {
>  	unsigned long outer_start, outer_end;
> -	int ret = 0, order;
> +	unsigned int order;
> +	int ret = 0;
>  
>  	struct compact_control cc = {
>  		.nr_migratepages = 0,
> -- 
> 2.5.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 5/5] mm: use 'unsigned int' for page order
@ 2015-08-20  8:32     ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-20  8:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, David Rientjes, linux-kernel,
	linux-mm

On Wed 19-08-15 12:21:46, Kirill A. Shutemov wrote:
> Let's try to be consistent about data type of page order.

Looks good to me.

We still have *_control::order but that is not directly related to this
patch series.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/mm.h |  5 +++--
>  mm/hugetlb.c       | 19 ++++++++++---------
>  mm/internal.h      |  4 ++--
>  mm/page_alloc.c    | 27 +++++++++++++++------------
>  4 files changed, 30 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a4c4b7d07473..a75bbb3f7142 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -557,7 +557,7 @@ static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
>  	return compound_page_dtors[page[1].compound_dtor];
>  }
>  
> -static inline int compound_order(struct page *page)
> +static inline unsigned int compound_order(struct page *page)
>  {
>  	if (!PageHead(page))
>  		return 0;
> @@ -1718,7 +1718,8 @@ extern void si_meminfo(struct sysinfo * val);
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
>  
>  extern __printf(3, 4)
> -void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
> +void warn_alloc_failed(gfp_t gfp_mask, unsigned int order,
> +		const char *fmt, ...);
>  
>  extern void setup_per_cpu_pageset(void);
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 53c0709fd87b..bf64bfebc473 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -817,7 +817,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  
>  #if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
>  static void destroy_compound_gigantic_page(struct page *page,
> -					unsigned long order)
> +					unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -832,7 +832,7 @@ static void destroy_compound_gigantic_page(struct page *page,
>  	__ClearPageHead(page);
>  }
>  
> -static void free_gigantic_page(struct page *page, unsigned order)
> +static void free_gigantic_page(struct page *page, unsigned int order)
>  {
>  	free_contig_range(page_to_pfn(page), 1 << order);
>  }
> @@ -876,7 +876,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
>  	return zone_spans_pfn(zone, last_pfn);
>  }
>  
> -static struct page *alloc_gigantic_page(int nid, unsigned order)
> +static struct page *alloc_gigantic_page(int nid, unsigned int order)
>  {
>  	unsigned long nr_pages = 1 << order;
>  	unsigned long ret, pfn, flags;
> @@ -912,7 +912,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
>  }
>  
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
> -static void prep_compound_gigantic_page(struct page *page, unsigned long order);
> +static void prep_compound_gigantic_page(struct page *page, unsigned int order);
>  
>  static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
>  {
> @@ -945,9 +945,9 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
>  static inline bool gigantic_page_supported(void) { return true; }
>  #else
>  static inline bool gigantic_page_supported(void) { return false; }
> -static inline void free_gigantic_page(struct page *page, unsigned order) { }
> +static inline void free_gigantic_page(struct page *page, unsigned int order) { }
>  static inline void destroy_compound_gigantic_page(struct page *page,
> -						unsigned long order) { }
> +						unsigned int order) { }
>  static inline int alloc_fresh_gigantic_page(struct hstate *h,
>  					nodemask_t *nodes_allowed) { return 0; }
>  #endif
> @@ -1073,7 +1073,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  	put_page(page); /* free it into the hugepage allocator */
>  }
>  
> -static void prep_compound_gigantic_page(struct page *page, unsigned long order)
> +static void prep_compound_gigantic_page(struct page *page, unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -1640,7 +1640,8 @@ found:
>  	return 1;
>  }
>  
> -static void __init prep_compound_huge_page(struct page *page, int order)
> +static void __init prep_compound_huge_page(struct page *page,
> +		unsigned int order)
>  {
>  	if (unlikely(order > (MAX_ORDER - 1)))
>  		prep_compound_gigantic_page(page, order);
> @@ -2351,7 +2352,7 @@ static int __init hugetlb_init(void)
>  module_init(hugetlb_init);
>  
>  /* Should be called on processing a hugepagesz=... option */
> -void __init hugetlb_add_hstate(unsigned order)
> +void __init hugetlb_add_hstate(unsigned int order)
>  {
>  	struct hstate *h;
>  	unsigned long i;
> diff --git a/mm/internal.h b/mm/internal.h
> index 89e21a07080a..9a9fc497593f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -157,7 +157,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
>  extern int __isolate_free_page(struct page *page, unsigned int order);
>  extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
>  					unsigned int order);
> -extern void prep_compound_page(struct page *page, unsigned long order);
> +extern void prep_compound_page(struct page *page, unsigned int order);
>  #ifdef CONFIG_MEMORY_FAILURE
>  extern bool is_free_buddy_page(struct page *page);
>  #endif
> @@ -214,7 +214,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
>   * page cannot be allocated or merged in parallel. Alternatively, it must
>   * handle invalid values gracefully, and use page_order_unsafe() below.
>   */
> -static inline unsigned long page_order(struct page *page)
> +static inline unsigned int page_order(struct page *page)
>  {
>  	/* PageBuddy() must be checked by the caller */
>  	return page_private(page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78859d47aaf4..347724850665 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -163,7 +163,7 @@ bool pm_suspended_storage(void)
>  #endif /* CONFIG_PM_SLEEP */
>  
>  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> -int pageblock_order __read_mostly;
> +unsigned int pageblock_order __read_mostly;
>  #endif
>  
>  static void __free_pages_ok(struct page *page, unsigned int order);
> @@ -441,7 +441,7 @@ static void free_compound_page(struct page *page)
>  	__free_pages_ok(page, compound_order(page));
>  }
>  
> -void prep_compound_page(struct page *page, unsigned long order)
> +void prep_compound_page(struct page *page, unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -641,7 +641,7 @@ static inline void __free_one_page(struct page *page,
>  	unsigned long combined_idx;
>  	unsigned long uninitialized_var(buddy_idx);
>  	struct page *buddy;
> -	int max_order = MAX_ORDER;
> +	unsigned int max_order = MAX_ORDER;
>  
>  	VM_BUG_ON(!zone_is_initialized(zone));
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1436,7 +1436,7 @@ int move_freepages(struct zone *zone,
>  			  int migratetype)
>  {
>  	struct page *page;
> -	unsigned long order;
> +	unsigned int order;
>  	int pages_moved = 0;
>  
>  #ifndef CONFIG_HOLES_IN_ZONE
> @@ -1550,7 +1550,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
>  static void steal_suitable_fallback(struct zone *zone, struct page *page,
>  							  int start_type)
>  {
> -	int current_order = page_order(page);
> +	unsigned int current_order = page_order(page);
>  	int pages;
>  
>  	/* Take ownership for orders >= pageblock_order */
> @@ -2657,7 +2657,7 @@ static DEFINE_RATELIMIT_STATE(nopage_rs,
>  		DEFAULT_RATELIMIT_INTERVAL,
>  		DEFAULT_RATELIMIT_BURST);
>  
> -void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> +void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
>  {
>  	unsigned int filter = SHOW_MEM_FILTER_NODES;
>  
> @@ -2691,7 +2691,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  		va_end(args);
>  	}
>  
> -	pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n",
> +	pr_warn("%s: page allocation failure: order:%u, mode:0x%x\n",
>  		current->comm, order, gfp_mask);
>  
>  	dump_stack();
> @@ -3450,7 +3450,8 @@ void free_kmem_pages(unsigned long addr, unsigned int order)
>  	}
>  }
>  
> -static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
> +static void *make_alloc_exact(unsigned long addr, unsigned int order,
> +		size_t size)
>  {
>  	if (addr) {
>  		unsigned long alloc_end = addr + (PAGE_SIZE << order);
> @@ -3502,7 +3503,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
>   */
>  void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
>  {
> -	unsigned order = get_order(size);
> +	unsigned int order = get_order(size);
>  	struct page *p = alloc_pages_node(nid, gfp_mask, order);
>  	if (!p)
>  		return NULL;
> @@ -3804,7 +3805,8 @@ void show_free_areas(unsigned int filter)
>  	}
>  
>  	for_each_populated_zone(zone) {
> -		unsigned long nr[MAX_ORDER], flags, order, total = 0;
> +		unsigned int order;
> +		unsigned long nr[MAX_ORDER], flags, total = 0;
>  		unsigned char types[MAX_ORDER];
>  
>  		if (skip_free_areas_node(filter, zone_to_nid(zone)))
> @@ -4153,7 +4155,7 @@ static void build_zonelists(pg_data_t *pgdat)
>  	nodemask_t used_mask;
>  	int local_node, prev_node;
>  	struct zonelist *zonelist;
> -	int order = current_zonelist_order;
> +	unsigned int order = current_zonelist_order;
>  
>  	/* initialize zonelists */
>  	for (i = 0; i < MAX_ZONELISTS; i++) {
> @@ -6818,7 +6820,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  		       unsigned migratetype)
>  {
>  	unsigned long outer_start, outer_end;
> -	int ret = 0, order;
> +	unsigned int order;
> +	int ret = 0;
>  
>  	struct compact_control cc = {
>  		.nr_migratepages = 0,
> -- 
> 2.5.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 0/5] Fix compound_head() race
  2015-08-19  9:21 ` Kirill A. Shutemov
@ 2015-08-20 12:31   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-20 12:31 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2252 bytes --]

On Wed, Aug 19, 2015 at 12:21:41PM +0300, Kirill A. Shutemov wrote:
> Here's my attempt on fixing recently discovered race in compound_head().
> It should make compound_head() reliable in all contexts.
> 
> The patchset is against Linus' tree. Let me know if it need to be rebased
> onto different baseline.
> 
> It's expected to have conflicts with my page-flags patchset and probably
> should be applied before it.
> 
> v3:
>    - Fix build without hugetlb;
>    - Drop page->first_page;
>    - Update comment for free_compound_page();
>    - Use 'unsigned int' for page order;
> 
> v2: Per Hugh's suggestion page->compound_head is moved into third double
>     word. This way we can avoid memory overhead which v1 had in some
>     cases.
> 
>     This place in struct page is rather overloaded. More testing is
>     required to make sure we don't collide with anyone.

Andrew, can we have the patchset applied, if nobody has objections?

It applies cleanly into your patchstack just before my page-flags
patchset.

As expected, it causes few conflicts with patches:

 page-flags-introduce-page-flags-policies-wrt-compound-pages.patch
 mm-sanitize-page-mapping-for-tail-pages.patch
 include-linux-page-flagsh-rename-macros-to-avoid-collisions.patch

Updated patches with solved conflicts are attached.

Let me know if I need to do anything else about this.

Hugh, does it address your worry wrt page-flags?

Before you've mentioned races of whether the head page still agrees with
the tail. I don't think it's an issue: you can get this kind of race only
in very special environments like pfn scanner where you anyway need to
re-validate the page after stabilizing it.

Bloat from my page-flags is also reduced substantially. Size of your
page_is_locked() example in allnoconfig case reduced from 32 to 17 bytes.
With the patchset it look this way:

00003070 <page_is_locked>:
    3070:	8b 50 14             	mov    0x14(%eax),%edx
    3073:	f6 c2 01             	test   $0x1,%dl
    3076:	8d 4a ff             	lea    -0x1(%edx),%ecx
    3079:	0f 45 c1             	cmovne %ecx,%eax
    307c:	8b 00                	mov    (%eax),%eax
    307e:	24 01                	and    $0x1,%al
    3080:	c3                   	ret    

-- 
 Kirill A. Shutemov

[-- Attachment #2: include-linux-page-flagsh-rename-macros-to-avoid-collisions.patch --]
[-- Type: text/plain, Size: 7813 bytes --]

>From 1b88b5b6025e81a9f6b99275d66e129de52bd795 Mon Sep 17 00:00:00 2001
From: Andrew Morton <akpm@linux-foundation.org>
Date: Tue, 18 Aug 2015 09:49:52 +1000
Subject: [PATCH] include/linux/page-flags.h: rename macros to avoid collisions

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/page-flags.h | 106 ++++++++++++++++++++++-----------------------
 1 file changed, 53 insertions(+), 53 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 3d9270a9e885..cfff9fd5d858 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -130,15 +130,15 @@ enum pageflags {
 #ifndef __GENERATING_BOUNDS_H
 
 /* Page flags policies wrt compound pages */
-#define ANY(page, enforce)	page
-#define HEAD(page, enforce)	compound_head(page)
-#define NO_TAIL(page, enforce) ({					\
+#define PF_ANY(page, enforce)	page
+#define PF_HEAD(page, enforce)	compound_head(page)
+#define PF_NO_TAIL(page, enforce) ({					\
 		if (enforce)						\
 			VM_BUG_ON_PAGE(PageTail(page), page);		\
 		else							\
 			page = compound_head(page);			\
 		page;})
-#define NO_COMPOUND(page, enforce) ({					\
+#define PF_NO_COMPOUND(page, enforce) ({					\
 		if (enforce)						\
 			VM_BUG_ON_PAGE(PageCompound(page), page);	\
 		page;})
@@ -225,55 +225,55 @@ static inline int PageCompound(struct page *page);
 static inline int PageTail(struct page *page);
 static struct page *compound_head(struct page *page);
 
-__PAGEFLAG(Locked, locked, NO_TAIL)
-PAGEFLAG(Error, error, NO_COMPOUND) TESTCLEARFLAG(Error, error, NO_COMPOUND)
-PAGEFLAG(Referenced, referenced, HEAD)
-	TESTCLEARFLAG(Referenced, referenced, HEAD)
-	__SETPAGEFLAG(Referenced, referenced, HEAD)
-PAGEFLAG(Dirty, dirty, HEAD) TESTSCFLAG(Dirty, dirty, HEAD)
-	__CLEARPAGEFLAG(Dirty, dirty, HEAD)
-PAGEFLAG(LRU, lru, HEAD) __CLEARPAGEFLAG(LRU, lru, HEAD)
-PAGEFLAG(Active, active, HEAD) __CLEARPAGEFLAG(Active, active, HEAD)
-	TESTCLEARFLAG(Active, active, HEAD)
-__PAGEFLAG(Slab, slab, NO_TAIL)
-__PAGEFLAG(SlobFree, slob_free, NO_TAIL)
-PAGEFLAG(Checked, checked, NO_COMPOUND) /* Used by some filesystems */
+__PAGEFLAG(Locked, locked, PF_NO_TAIL)
+PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
+PAGEFLAG(Referenced, referenced, PF_HEAD)
+	TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
+	__SETPAGEFLAG(Referenced, referenced, PF_HEAD)
+PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
+	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
+PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
+	TESTCLEARFLAG(Active, active, PF_HEAD)
+__PAGEFLAG(Slab, slab, PF_NO_TAIL)
+__PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
+PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
 
 /* Xen */
-PAGEFLAG(Pinned, pinned, NO_COMPOUND) TESTSCFLAG(Pinned, pinned, NO_COMPOUND)
-PAGEFLAG(SavePinned, savepinned, NO_COMPOUND)
-PAGEFLAG(Foreign, foreign, NO_COMPOUND)
+PAGEFLAG(Pinned, pinned, PF_NO_COMPOUND) TESTSCFLAG(Pinned, pinned, PF_NO_COMPOUND)
+PAGEFLAG(SavePinned, savepinned, PF_NO_COMPOUND)
+PAGEFLAG(Foreign, foreign, PF_NO_COMPOUND)
 
-PAGEFLAG(Reserved, reserved, NO_COMPOUND)
-	__CLEARPAGEFLAG(Reserved, reserved, NO_COMPOUND)
-PAGEFLAG(SwapBacked, swapbacked, NO_TAIL)
-	__CLEARPAGEFLAG(SwapBacked, swapbacked, NO_TAIL)
-	__SETPAGEFLAG(SwapBacked, swapbacked, NO_TAIL)
+PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND)
+	__CLEARPAGEFLAG(Reserved, reserved, PF_NO_COMPOUND)
+PAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
+	__CLEARPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
+	__SETPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
 
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
  * - PG_private and PG_private_2 cause releasepage() and co to be invoked
  */
-PAGEFLAG(Private, private, ANY) __SETPAGEFLAG(Private, private, ANY)
-	__CLEARPAGEFLAG(Private, private, ANY)
-PAGEFLAG(Private2, private_2, ANY) TESTSCFLAG(Private2, private_2, ANY)
-PAGEFLAG(OwnerPriv1, owner_priv_1, ANY)
-	TESTCLEARFLAG(OwnerPriv1, owner_priv_1, ANY)
+PAGEFLAG(Private, private, PF_ANY) __SETPAGEFLAG(Private, private, PF_ANY)
+	__CLEARPAGEFLAG(Private, private, PF_ANY)
+PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY)
+PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
+	TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
 
 /*
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
  * risky: they bypass page accounting.
  */
-TESTPAGEFLAG(Writeback, writeback, NO_COMPOUND)
-	TESTSCFLAG(Writeback, writeback, NO_COMPOUND)
-PAGEFLAG(MappedToDisk, mappedtodisk, NO_COMPOUND)
+TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
+	TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
+PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_COMPOUND)
 
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
-PAGEFLAG(Reclaim, reclaim, NO_COMPOUND)
-	TESTCLEARFLAG(Reclaim, reclaim, NO_COMPOUND)
-PAGEFLAG(Readahead, reclaim, NO_COMPOUND)
-	TESTCLEARFLAG(Readahead, reclaim, NO_COMPOUND)
+PAGEFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
+	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+	TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
 
 #ifdef CONFIG_HIGHMEM
 /*
@@ -286,33 +286,33 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache, NO_COMPOUND)
+PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
 
-PAGEFLAG(Unevictable, unevictable, HEAD)
-	__CLEARPAGEFLAG(Unevictable, unevictable, HEAD)
-	TESTCLEARFLAG(Unevictable, unevictable, HEAD)
+PAGEFLAG(Unevictable, unevictable, PF_HEAD)
+	__CLEARPAGEFLAG(Unevictable, unevictable, PF_HEAD)
+	TESTCLEARFLAG(Unevictable, unevictable, PF_HEAD)
 
 #ifdef CONFIG_MMU
-PAGEFLAG(Mlocked, mlocked, NO_TAIL) __CLEARPAGEFLAG(Mlocked, mlocked, NO_TAIL)
-	TESTSCFLAG(Mlocked, mlocked, NO_TAIL)
-	__TESTCLEARFLAG(Mlocked, mlocked, NO_TAIL)
+PAGEFLAG(Mlocked, mlocked, PF_NO_TAIL) __CLEARPAGEFLAG(Mlocked, mlocked, PF_NO_TAIL)
+	TESTSCFLAG(Mlocked, mlocked, PF_NO_TAIL)
+	__TESTCLEARFLAG(Mlocked, mlocked, PF_NO_TAIL)
 #else
 PAGEFLAG_FALSE(Mlocked) __CLEARPAGEFLAG_NOOP(Mlocked)
 	TESTSCFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked)
 #endif
 
 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
-PAGEFLAG(Uncached, uncached, NO_COMPOUND)
+PAGEFLAG(Uncached, uncached, PF_NO_COMPOUND)
 #else
 PAGEFLAG_FALSE(Uncached)
 #endif
 
 #ifdef CONFIG_MEMORY_FAILURE
-PAGEFLAG(HWPoison, hwpoison, ANY)
-TESTSCFLAG(HWPoison, hwpoison, ANY)
+PAGEFLAG(HWPoison, hwpoison, PF_ANY)
+TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
 #else
 PAGEFLAG_FALSE(HWPoison)
@@ -402,7 +402,7 @@ static inline void SetPageUptodate(struct page *page)
 	set_bit(PG_uptodate, &page->flags);
 }
 
-CLEARPAGEFLAG(Uptodate, uptodate, NO_TAIL)
+CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)
 
 int test_clear_page_writeback(struct page *page);
 int __test_set_page_writeback(struct page *page, bool keep_write);
@@ -422,7 +422,7 @@ static inline void set_page_writeback_keepwrite(struct page *page)
 	test_set_page_writeback_keepwrite(page);
 }
 
-__PAGEFLAG(Head, head, ANY) CLEARPAGEFLAG(Head, head, ANY)
+__PAGEFLAG(Head, head, PF_ANY) CLEARPAGEFLAG(Head, head, PF_ANY)
 
 static inline int PageTail(struct page *page)
 {
@@ -643,10 +643,10 @@ static inline int page_has_private(struct page *page)
 	return !!(page->flags & PAGE_FLAGS_PRIVATE);
 }
 
-#undef ANY
-#undef HEAD
-#undef NO_TAIL
-#undef NO_COMPOUND
+#undef PF_ANY
+#undef PF_HEAD
+#undef PF_NO_TAIL
+#undef PF_NO_COMPOUND
 #endif /* !__GENERATING_BOUNDS_H */
 
 #endif	/* PAGE_FLAGS_H */
-- 
2.5.0


[-- Attachment #3: page-flags-introduce-page-flags-policies-wrt-compound-pages.patch --]
[-- Type: text/plain, Size: 11630 bytes --]

>From 54d99b201f355af4e4bd401a1b39a8570dcda948 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Tue, 18 Aug 2015 09:49:48 +1000
Subject: [PATCH] page-flags: introduce page flags policies wrt compound pages

This patch adds a third argument to macros which create function
definitions for page flags.  This argument defines how page-flags helpers
behave on compound functions.

For now we define four policies:

- PF_ANY: the helper function operates on the page it gets, regardless
  if it's non-compound, head or tail.

- PF_HEAD: the helper function operates on the head page of the compound
  page if it gets tail page.

- PF_NO_TAIL: only head and non-compond pages are acceptable for this
  helper function.

- PF_NO_COMPOUND: only non-compound pages are acceptable for this helper
  function.

For now we use policy PF_ANY for all helpers, which matches current
behaviour.

We do not enforce the policy for TESTPAGEFLAG, because we have flags
checked for random pages all over the kernel.  Noticeable exception to
this is PageTransHuge() which triggers VM_BUG_ON() for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/page-flags.h | 153 +++++++++++++++++++++++++++------------------
 1 file changed, 92 insertions(+), 61 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 490fbd3f8552..85b60119523a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -129,49 +129,68 @@ enum pageflags {
 
 #ifndef __GENERATING_BOUNDS_H
 
+/* Page flags policies wrt compound pages */
+#define ANY(page, enforce)	page
+#define HEAD(page, enforce)	compound_head(page)
+#define NO_TAIL(page, enforce) ({					\
+		if (enforce)						\
+			VM_BUG_ON_PAGE(PageTail(page), page);		\
+		else							\
+			page = compound_head(page);			\
+		page;})
+#define NO_COMPOUND(page, enforce) ({					\
+		if (enforce)						\
+			VM_BUG_ON_PAGE(PageCompound(page), page);	\
+		page;})
+
 /*
  * Macros to create function definitions for page flags
  */
-#define TESTPAGEFLAG(uname, lname)					\
-static inline int Page##uname(const struct page *page)			\
-			{ return test_bit(PG_##lname, &page->flags); }
+#define TESTPAGEFLAG(uname, lname, policy)				\
+static inline int Page##uname(struct page *page)			\
+	{ return test_bit(PG_##lname, &policy(page, 0)->flags); }
 
-#define SETPAGEFLAG(uname, lname)					\
+#define SETPAGEFLAG(uname, lname, policy)				\
 static inline void SetPage##uname(struct page *page)			\
-			{ set_bit(PG_##lname, &page->flags); }
+	{ set_bit(PG_##lname, &policy(page, 1)->flags); }
 
-#define CLEARPAGEFLAG(uname, lname)					\
+#define CLEARPAGEFLAG(uname, lname, policy)				\
 static inline void ClearPage##uname(struct page *page)			\
-			{ clear_bit(PG_##lname, &page->flags); }
+	{ clear_bit(PG_##lname, &policy(page, 1)->flags); }
 
-#define __SETPAGEFLAG(uname, lname)					\
+#define __SETPAGEFLAG(uname, lname, policy)				\
 static inline void __SetPage##uname(struct page *page)			\
-			{ __set_bit(PG_##lname, &page->flags); }
+	{ __set_bit(PG_##lname, &policy(page, 1)->flags); }
 
-#define __CLEARPAGEFLAG(uname, lname)					\
+#define __CLEARPAGEFLAG(uname, lname, policy)				\
 static inline void __ClearPage##uname(struct page *page)		\
-			{ __clear_bit(PG_##lname, &page->flags); }
+	{ __clear_bit(PG_##lname, &policy(page, 1)->flags); }
 
-#define TESTSETFLAG(uname, lname)					\
+#define TESTSETFLAG(uname, lname, policy)				\
 static inline int TestSetPage##uname(struct page *page)			\
-		{ return test_and_set_bit(PG_##lname, &page->flags); }
+	{ return test_and_set_bit(PG_##lname, &policy(page, 1)->flags); }
 
-#define TESTCLEARFLAG(uname, lname)					\
+#define TESTCLEARFLAG(uname, lname, policy)				\
 static inline int TestClearPage##uname(struct page *page)		\
-		{ return test_and_clear_bit(PG_##lname, &page->flags); }
+	{ return test_and_clear_bit(PG_##lname, &policy(page, 1)->flags); }
 
-#define __TESTCLEARFLAG(uname, lname)					\
+#define __TESTCLEARFLAG(uname, lname, policy)				\
 static inline int __TestClearPage##uname(struct page *page)		\
-		{ return __test_and_clear_bit(PG_##lname, &page->flags); }
+	{ return __test_and_clear_bit(PG_##lname, &policy(page, 1)->flags); }
 
-#define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname)		\
-	SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname)
+#define PAGEFLAG(uname, lname, policy)					\
+	TESTPAGEFLAG(uname, lname, policy)				\
+	SETPAGEFLAG(uname, lname, policy)				\
+	CLEARPAGEFLAG(uname, lname, policy)
 
-#define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname)		\
-	__SETPAGEFLAG(uname, lname)  __CLEARPAGEFLAG(uname, lname)
+#define __PAGEFLAG(uname, lname, policy)				\
+	TESTPAGEFLAG(uname, lname, policy)				\
+	__SETPAGEFLAG(uname, lname, policy)				\
+	__CLEARPAGEFLAG(uname, lname, policy)
 
-#define TESTSCFLAG(uname, lname)					\
-	TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname)
+#define TESTSCFLAG(uname, lname, policy)				\
+	TESTSETFLAG(uname, lname, policy)				\
+	TESTCLEARFLAG(uname, lname, policy)
 
 #define TESTPAGEFLAG_FALSE(uname)					\
 static inline int Page##uname(const struct page *page) { return 0; }
@@ -200,47 +219,54 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
 #define TESTSCFLAG_FALSE(uname)						\
 	TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
-struct page;	/* forward declaration */
-
-TESTPAGEFLAG(Locked, locked)
-PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
-PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
-	__SETPAGEFLAG(Referenced, referenced)
-PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
-PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
-PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
-	TESTCLEARFLAG(Active, active)
-__PAGEFLAG(Slab, slab)
-PAGEFLAG(Checked, checked)		/* Used by some filesystems */
-PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
-PAGEFLAG(SavePinned, savepinned);			/* Xen */
-PAGEFLAG(Foreign, foreign);				/* Xen */
-PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
-PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
-	__SETPAGEFLAG(SwapBacked, swapbacked)
-
-__PAGEFLAG(SlobFree, slob_free)
+/* Forward declarations */
+struct page;
+static inline int PageCompound(struct page *page);
+static inline int PageTail(struct page *page);
+static struct page *compound_head(struct page *page);
+
+TESTPAGEFLAG(Locked, locked, ANY)
+PAGEFLAG(Error, error, ANY) TESTCLEARFLAG(Error, error, ANY)
+PAGEFLAG(Referenced, referenced, ANY) TESTCLEARFLAG(Referenced, referenced, ANY)
+	__SETPAGEFLAG(Referenced, referenced, ANY)
+PAGEFLAG(Dirty, dirty, ANY) TESTSCFLAG(Dirty, dirty, ANY)
+	__CLEARPAGEFLAG(Dirty, dirty, ANY)
+PAGEFLAG(LRU, lru, ANY) __CLEARPAGEFLAG(LRU, lru, ANY)
+PAGEFLAG(Active, active, ANY) __CLEARPAGEFLAG(Active, active, ANY)
+	TESTCLEARFLAG(Active, active, ANY)
+__PAGEFLAG(Slab, slab, ANY)
+PAGEFLAG(Checked, checked, ANY)		/* Used by some filesystems */
+PAGEFLAG(Pinned, pinned, ANY) TESTSCFLAG(Pinned, pinned, ANY)	/* Xen */
+PAGEFLAG(SavePinned, savepinned, ANY);			/* Xen */
+PAGEFLAG(Foreign, foreign, ANY);				/* Xen */
+PAGEFLAG(Reserved, reserved, ANY) __CLEARPAGEFLAG(Reserved, reserved, ANY)
+PAGEFLAG(SwapBacked, swapbacked, ANY)
+	__CLEARPAGEFLAG(SwapBacked, swapbacked, ANY)
+	__SETPAGEFLAG(SwapBacked, swapbacked, ANY)
+
+__PAGEFLAG(SlobFree, slob_free, ANY)
 
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
  * - PG_private and PG_private_2 cause releasepage() and co to be invoked
  */
-PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private)
-	__CLEARPAGEFLAG(Private, private)
-PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2)
-PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1)
+PAGEFLAG(Private, private, ANY) __SETPAGEFLAG(Private, private, ANY)
+	__CLEARPAGEFLAG(Private, private, ANY)
+PAGEFLAG(Private2, private_2, ANY) TESTSCFLAG(Private2, private_2, ANY)
+PAGEFLAG(OwnerPriv1, owner_priv_1, ANY)
+	TESTCLEARFLAG(OwnerPriv1, owner_priv_1, ANY)
 
 /*
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
  * risky: they bypass page accounting.
  */
-TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
-PAGEFLAG(MappedToDisk, mappedtodisk)
+TESTPAGEFLAG(Writeback, writeback, ANY) TESTSCFLAG(Writeback, writeback, ANY)
+PAGEFLAG(MappedToDisk, mappedtodisk, ANY)
 
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
-PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
-PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
+PAGEFLAG(Reclaim, reclaim, ANY) TESTCLEARFLAG(Reclaim, reclaim, ANY)
+PAGEFLAG(Readahead, reclaim, ANY) TESTCLEARFLAG(Readahead, reclaim, ANY)
 
 #ifdef CONFIG_HIGHMEM
 /*
@@ -253,31 +279,32 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache)
+PAGEFLAG(SwapCache, swapcache, ANY)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
 
-PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable)
-	TESTCLEARFLAG(Unevictable, unevictable)
+PAGEFLAG(Unevictable, unevictable, ANY)
+	__CLEARPAGEFLAG(Unevictable, unevictable, ANY)
+	TESTCLEARFLAG(Unevictable, unevictable, ANY)
 
 #ifdef CONFIG_MMU
-PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
-	TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked)
+PAGEFLAG(Mlocked, mlocked, ANY) __CLEARPAGEFLAG(Mlocked, mlocked, ANY)
+	TESTSCFLAG(Mlocked, mlocked, ANY) __TESTCLEARFLAG(Mlocked, mlocked, ANY)
 #else
 PAGEFLAG_FALSE(Mlocked) __CLEARPAGEFLAG_NOOP(Mlocked)
 	TESTSCFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked)
 #endif
 
 #ifdef CONFIG_ARCH_USES_PG_UNCACHED
-PAGEFLAG(Uncached, uncached)
+PAGEFLAG(Uncached, uncached, ANY)
 #else
 PAGEFLAG_FALSE(Uncached)
 #endif
 
 #ifdef CONFIG_MEMORY_FAILURE
-PAGEFLAG(HWPoison, hwpoison)
-TESTSCFLAG(HWPoison, hwpoison)
+PAGEFLAG(HWPoison, hwpoison, ANY)
+TESTSCFLAG(HWPoison, hwpoison, ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
 #else
 PAGEFLAG_FALSE(HWPoison)
@@ -362,7 +389,7 @@ static inline void SetPageUptodate(struct page *page)
 	set_bit(PG_uptodate, &(page)->flags);
 }
 
-CLEARPAGEFLAG(Uptodate, uptodate)
+CLEARPAGEFLAG(Uptodate, uptodate, ANY)
 
 int test_clear_page_writeback(struct page *page);
 int __test_set_page_writeback(struct page *page, bool keep_write);
@@ -382,7 +409,7 @@ static inline void set_page_writeback_keepwrite(struct page *page)
 	test_set_page_writeback_keepwrite(page);
 }
 
-__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
+__PAGEFLAG(Head, head, ANY) CLEARPAGEFLAG(Head, head, ANY)
 
 static inline int PageTail(struct page *page)
 {
@@ -603,6 +630,10 @@ static inline int page_has_private(struct page *page)
 	return !!(page->flags & PAGE_FLAGS_PRIVATE);
 }
 
+#undef ANY
+#undef HEAD
+#undef NO_TAIL
+#undef NO_COMPOUND
 #endif /* !__GENERATING_BOUNDS_H */
 
 #endif	/* PAGE_FLAGS_H */
-- 
2.5.0


[-- Attachment #4: mm-sanitize-page-mapping-for-tail-pages.patch --]
[-- Type: text/plain, Size: 4740 bytes --]

>From cdb3d7e7f717f3e96ee97c4b0a29a745701bd813 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Tue, 18 Aug 2015 09:49:51 +1000
Subject: [PATCH] mm: sanitize page->mapping for tail pages

We don't define meaning of page->mapping for tail pages.  Currently it's
always NULL, which can be inconsistent with head page and potentially lead
to problems.

Let's poison the pointer to catch all illigal uses.

page_rmapping(), page_mapping() and page_anon_vma() are changed to look on
head page.

The only illegal use I've caught so far is __GPF_COMP pages from sound
subsystem, mapped with PTEs.  do_shared_fault() is changed to use
page_rmapping() instead of direct access to fault_page->mapping.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/poison.h |  4 ++++
 mm/huge_memory.c       |  2 +-
 mm/memory.c            |  2 +-
 mm/page_alloc.c        |  6 ++++++
 mm/util.c              | 10 ++++++----
 5 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 2110a81c5e2a..7b2a7fcde6a3 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -32,6 +32,10 @@
 /********** mm/debug-pagealloc.c **********/
 #define PAGE_POISON 0xaa
 
+/********** mm/page_alloc.c ************/
+
+#define TAIL_MAPPING	((void *) 0x01014A11 + POISON_POINTER_DELTA)
+
 /********** mm/slab.c **********/
 /*
  * Magic nums for obj red zoning.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ef15f8f8bf3..1a3accef0756 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1772,7 +1772,7 @@ static void __split_huge_page_refcount(struct page *page,
 		*/
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
+		BUG_ON(page_tail->mapping != TAIL_MAPPING);
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
diff --git a/mm/memory.c b/mm/memory.c
index 6cd0b2160401..558ee16167d9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3087,7 +3087,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * pinned by vma->vm_file's reference.  We rely on unlock_page()'s
 	 * release semantics to prevent the compiler from undoing this copying.
 	 */
-	mapping = fault_page->mapping;
+	mapping = page_rmapping(fault_page);
 	unlock_page(fault_page);
 	if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
 		/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d752298a9e48..adefa3ad8e3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -470,6 +470,7 @@ void prep_compound_page(struct page *page, unsigned int order)
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 		set_page_count(p, 0);
+		p->mapping = TAIL_MAPPING;
 		set_compound_head(p, page);
 	}
 }
@@ -855,6 +856,10 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 		ret = 0;
 		goto out;
 	}
+	if (page->mapping != TAIL_MAPPING) {
+		bad_page(page, "corrupted mapping in tail page", 0);
+		goto out;
+	}
 	if (unlikely(!PageTail(page))) {
 		bad_page(page, "PageTail not set", 0);
 		goto out;
@@ -865,6 +870,7 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 	}
 	ret = 0;
 out:
+	page->mapping = NULL;
 	clear_compound_head(page);
 	return ret;
 }
diff --git a/mm/util.c b/mm/util.c
index 68ff8a5361e7..0c7f65e7ef5e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -355,7 +355,9 @@ struct anon_vma *page_anon_vma(struct page *page)
 
 struct address_space *page_mapping(struct page *page)
 {
-	unsigned long mapping;
+	struct address_space *mapping;
+
+	page = compound_head(page);
 
 	/* This happens if someone calls flush_dcache_page on slab page */
 	if (unlikely(PageSlab(page)))
@@ -368,10 +370,10 @@ struct address_space *page_mapping(struct page *page)
 		return swap_address_space(entry);
 	}
 
-	mapping = (unsigned long)page->mapping;
-	if (mapping & PAGE_MAPPING_FLAGS)
+	mapping = page->mapping;
+	if ((unsigned long)mapping & PAGE_MAPPING_FLAGS)
 		return NULL;
-	return page->mapping;
+	return mapping;
 }
 
 int overcommit_ratio_handler(struct ctl_table *table, int write,
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 0/5] Fix compound_head() race
@ 2015-08-20 12:31   ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-20 12:31 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2252 bytes --]

On Wed, Aug 19, 2015 at 12:21:41PM +0300, Kirill A. Shutemov wrote:
> Here's my attempt on fixing recently discovered race in compound_head().
> It should make compound_head() reliable in all contexts.
> 
> The patchset is against Linus' tree. Let me know if it need to be rebased
> onto different baseline.
> 
> It's expected to have conflicts with my page-flags patchset and probably
> should be applied before it.
> 
> v3:
>    - Fix build without hugetlb;
>    - Drop page->first_page;
>    - Update comment for free_compound_page();
>    - Use 'unsigned int' for page order;
> 
> v2: Per Hugh's suggestion page->compound_head is moved into third double
>     word. This way we can avoid memory overhead which v1 had in some
>     cases.
> 
>     This place in struct page is rather overloaded. More testing is
>     required to make sure we don't collide with anyone.

Andrew, can we have the patchset applied, if nobody has objections?

It applies cleanly into your patchstack just before my page-flags
patchset.

As expected, it causes few conflicts with patches:

 page-flags-introduce-page-flags-policies-wrt-compound-pages.patch
 mm-sanitize-page-mapping-for-tail-pages.patch
 include-linux-page-flagsh-rename-macros-to-avoid-collisions.patch

Updated patches with solved conflicts are attached.

Let me know if I need to do anything else about this.

Hugh, does it address your worry wrt page-flags?

Before you've mentioned races of whether the head page still agrees with
the tail. I don't think it's an issue: you can get this kind of race only
in very special environments like pfn scanner where you anyway need to
re-validate the page after stabilizing it.

Bloat from my page-flags is also reduced substantially. Size of your
page_is_locked() example in allnoconfig case reduced from 32 to 17 bytes.
With the patchset it look this way:

00003070 <page_is_locked>:
    3070:	8b 50 14             	mov    0x14(%eax),%edx
    3073:	f6 c2 01             	test   $0x1,%dl
    3076:	8d 4a ff             	lea    -0x1(%edx),%ecx
    3079:	0f 45 c1             	cmovne %ecx,%eax
    307c:	8b 00                	mov    (%eax),%eax
    307e:	24 01                	and    $0x1,%al
    3080:	c3                   	ret    

-- 
 Kirill A. Shutemov

[-- Attachment #2: include-linux-page-flagsh-rename-macros-to-avoid-collisions.patch --]
[-- Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
  2015-08-19  9:21   ` Kirill A. Shutemov
@ 2015-08-20 23:26     ` Andrew Morton
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrew Morton @ 2015-08-20 23:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Dave Hansen, Vlastimil Babka,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm

On Wed, 19 Aug 2015 12:21:44 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> The patch halves space occupied by compound_dtor and compound_order in
> struct page.
> 
> For compound_order, it's trivial long -> int/short conversion.
> 
> For get_compound_page_dtor(), we now use hardcoded table for destructor
> lookup and store its index in the struct page instead of direct pointer
> to destructor. It shouldn't be a big trouble to maintain the table: we
> have only two destructor and NULL currently.
> 
> This patch free up one word in tail pages for reuse. This is preparation
> for the next patch.
> 
> ...
>
> @@ -145,8 +143,13 @@ struct page {
>  						 */
>  		/* First tail page of compound page */
>  		struct {
> -			compound_page_dtor *compound_dtor;
> -			unsigned long compound_order;
> +#ifdef CONFIG_64BIT
> +			unsigned int compound_dtor;
> +			unsigned int compound_order;
> +#else
> +			unsigned short int compound_dtor;
> +			unsigned short int compound_order;
> +#endif

Why not use ushort for 64-bit as well?

It would be clearer if that new enum had a name, so we use "enum
compound_dtor_id" everywhere instead of a bare uint.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
@ 2015-08-20 23:26     ` Andrew Morton
  0 siblings, 0 replies; 96+ messages in thread
From: Andrew Morton @ 2015-08-20 23:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Dave Hansen, Vlastimil Babka,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm

On Wed, 19 Aug 2015 12:21:44 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> The patch halves space occupied by compound_dtor and compound_order in
> struct page.
> 
> For compound_order, it's trivial long -> int/short conversion.
> 
> For get_compound_page_dtor(), we now use hardcoded table for destructor
> lookup and store its index in the struct page instead of direct pointer
> to destructor. It shouldn't be a big trouble to maintain the table: we
> have only two destructor and NULL currently.
> 
> This patch free up one word in tail pages for reuse. This is preparation
> for the next patch.
> 
> ...
>
> @@ -145,8 +143,13 @@ struct page {
>  						 */
>  		/* First tail page of compound page */
>  		struct {
> -			compound_page_dtor *compound_dtor;
> -			unsigned long compound_order;
> +#ifdef CONFIG_64BIT
> +			unsigned int compound_dtor;
> +			unsigned int compound_order;
> +#else
> +			unsigned short int compound_dtor;
> +			unsigned short int compound_order;
> +#endif

Why not use ushort for 64-bit as well?

It would be clearer if that new enum had a name, so we use "enum
compound_dtor_id" everywhere instead of a bare uint.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-19  9:21   ` Kirill A. Shutemov
@ 2015-08-20 23:36     ` Andrew Morton
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrew Morton @ 2015-08-20 23:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Dave Hansen, Vlastimil Babka,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm

On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
> 	CPU0					CPU1
> 
> isolate_migratepages_block()
>   page_count()
>     compound_head()
>       !!PageTail() == true
> 					put_page()
> 					  tail->first_page = NULL
>       head = tail->first_page
> 					alloc_pages(__GFP_COMP)
> 					   prep_compound_page()
> 					     tail->first_page = head
> 					     __SetPageTail(p);
>       !!PageTail() == true
>     <head == NULL dereferencing>
> 
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
> 
> We can fix the race by changing how encode PageTail() and compound_head()
> within struct page to be able to update them in one shot.
> 
> The patch introduces page->compound_head into third double word block in
> front of compound_dtor and compound_order. That means it shares storage
> space with:
> 
>  - page->lru.next;
>  - page->next;
>  - page->rcu_head.next;
>  - page->pmd_huge_pte;
> 
> That's too long list to be absolutely sure, but looks like nobody uses
> bit 0 of the word. It can be used to encode PageTail(). And if the bit
> set, rest of the word is pointer to head page.

So nothing else which participates in the union in the "Third double
word block" is allowed to use bit zero of the first word.

Is this really true?  For example if it's a slab page, will that page
ever be inspected by code which is looking for the PageTail bit?


Anyway, this is quite subtle and there's a risk that people will
accidentally break it later on.  I don't think the patch puts
sufficient documentation in place to prevent this.  And even
documentation might not be enough to prevent accidents.

>
> ...
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -120,7 +120,12 @@ struct page {
>  		};
>  	};
>  
> -	/* Third double word block */
> +	/*
> +	 * Third double word block
> +	 *
> +	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> +	 * for non-tail pages.
> +	 */
>  	union {
>  		struct list_head lru;	/* Pageout list, eg. active_list
>  					 * protected by zone->lru_lock !
> @@ -143,6 +148,7 @@ struct page {
>  						 */
>  		/* First tail page of compound page */
>  		struct {
> +			unsigned long compound_head; /* If bit zero is set */

I think the comments around here should have more details and should
be louder!


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-20 23:36     ` Andrew Morton
  0 siblings, 0 replies; 96+ messages in thread
From: Andrew Morton @ 2015-08-20 23:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Dave Hansen, Vlastimil Babka,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm

On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
> 	CPU0					CPU1
> 
> isolate_migratepages_block()
>   page_count()
>     compound_head()
>       !!PageTail() == true
> 					put_page()
> 					  tail->first_page = NULL
>       head = tail->first_page
> 					alloc_pages(__GFP_COMP)
> 					   prep_compound_page()
> 					     tail->first_page = head
> 					     __SetPageTail(p);
>       !!PageTail() == true
>     <head == NULL dereferencing>
> 
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
> 
> We can fix the race by changing how encode PageTail() and compound_head()
> within struct page to be able to update them in one shot.
> 
> The patch introduces page->compound_head into third double word block in
> front of compound_dtor and compound_order. That means it shares storage
> space with:
> 
>  - page->lru.next;
>  - page->next;
>  - page->rcu_head.next;
>  - page->pmd_huge_pte;
> 
> That's too long list to be absolutely sure, but looks like nobody uses
> bit 0 of the word. It can be used to encode PageTail(). And if the bit
> set, rest of the word is pointer to head page.

So nothing else which participates in the union in the "Third double
word block" is allowed to use bit zero of the first word.

Is this really true?  For example if it's a slab page, will that page
ever be inspected by code which is looking for the PageTail bit?


Anyway, this is quite subtle and there's a risk that people will
accidentally break it later on.  I don't think the patch puts
sufficient documentation in place to prevent this.  And even
documentation might not be enough to prevent accidents.

>
> ...
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -120,7 +120,12 @@ struct page {
>  		};
>  	};
>  
> -	/* Third double word block */
> +	/*
> +	 * Third double word block
> +	 *
> +	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> +	 * for non-tail pages.
> +	 */
>  	union {
>  		struct list_head lru;	/* Pageout list, eg. active_list
>  					 * protected by zone->lru_lock !
> @@ -143,6 +148,7 @@ struct page {
>  						 */
>  		/* First tail page of compound page */
>  		struct {
> +			unsigned long compound_head; /* If bit zero is set */

I think the comments around here should have more details and should
be louder!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 0/5] Fix compound_head() race
  2015-08-20 12:31   ` Kirill A. Shutemov
@ 2015-08-20 23:38     ` Andrew Morton
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrew Morton @ 2015-08-20 23:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Dave Hansen, Vlastimil Babka,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm

On Thu, 20 Aug 2015 15:31:07 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Wed, Aug 19, 2015 at 12:21:41PM +0300, Kirill A. Shutemov wrote:
> > Here's my attempt on fixing recently discovered race in compound_head().
> > It should make compound_head() reliable in all contexts.
> > 
> > The patchset is against Linus' tree. Let me know if it need to be rebased
> > onto different baseline.
> > 
> > It's expected to have conflicts with my page-flags patchset and probably
> > should be applied before it.
> > 
> > v3:
> >    - Fix build without hugetlb;
> >    - Drop page->first_page;
> >    - Update comment for free_compound_page();
> >    - Use 'unsigned int' for page order;
> > 
> > v2: Per Hugh's suggestion page->compound_head is moved into third double
> >     word. This way we can avoid memory overhead which v1 had in some
> >     cases.
> > 
> >     This place in struct page is rather overloaded. More testing is
> >     required to make sure we don't collide with anyone.
> 
> Andrew, can we have the patchset applied, if nobody has objections?

I've been hoping to hear from Hugh and I wasn't planning on processing
these before the 4.2 release.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 0/5] Fix compound_head() race
@ 2015-08-20 23:38     ` Andrew Morton
  0 siblings, 0 replies; 96+ messages in thread
From: Andrew Morton @ 2015-08-20 23:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Dave Hansen, Vlastimil Babka,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm

On Thu, 20 Aug 2015 15:31:07 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Wed, Aug 19, 2015 at 12:21:41PM +0300, Kirill A. Shutemov wrote:
> > Here's my attempt on fixing recently discovered race in compound_head().
> > It should make compound_head() reliable in all contexts.
> > 
> > The patchset is against Linus' tree. Let me know if it need to be rebased
> > onto different baseline.
> > 
> > It's expected to have conflicts with my page-flags patchset and probably
> > should be applied before it.
> > 
> > v3:
> >    - Fix build without hugetlb;
> >    - Drop page->first_page;
> >    - Update comment for free_compound_page();
> >    - Use 'unsigned int' for page order;
> > 
> > v2: Per Hugh's suggestion page->compound_head is moved into third double
> >     word. This way we can avoid memory overhead which v1 had in some
> >     cases.
> > 
> >     This place in struct page is rather overloaded. More testing is
> >     required to make sure we don't collide with anyone.
> 
> Andrew, can we have the patchset applied, if nobody has objections?

I've been hoping to hear from Hugh and I wasn't planning on processing
these before the 4.2 release.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
  2015-08-20 23:26     ` Andrew Morton
@ 2015-08-21  7:13       ` Michal Hocko
  -1 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-21  7:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, David Rientjes, linux-kernel,
	linux-mm

On Thu 20-08-15 16:26:04, Andrew Morton wrote:
> On Wed, 19 Aug 2015 12:21:44 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > The patch halves space occupied by compound_dtor and compound_order in
> > struct page.
> > 
> > For compound_order, it's trivial long -> int/short conversion.
> > 
> > For get_compound_page_dtor(), we now use hardcoded table for destructor
> > lookup and store its index in the struct page instead of direct pointer
> > to destructor. It shouldn't be a big trouble to maintain the table: we
> > have only two destructor and NULL currently.
> > 
> > This patch free up one word in tail pages for reuse. This is preparation
> > for the next patch.
> > 
> > ...
> >
> > @@ -145,8 +143,13 @@ struct page {
> >  						 */
> >  		/* First tail page of compound page */
> >  		struct {
> > -			compound_page_dtor *compound_dtor;
> > -			unsigned long compound_order;
> > +#ifdef CONFIG_64BIT
> > +			unsigned int compound_dtor;
> > +			unsigned int compound_order;
> > +#else
> > +			unsigned short int compound_dtor;
> > +			unsigned short int compound_order;
> > +#endif
> 
> Why not use ushort for 64-bit as well?

Yeah, I have asked the same in the previous round. So I've tried to
compile with ushort. The resulting code was slightly larger
   text    data     bss     dec     hex filename
 476370   90811   44632  611813   955e5 mm/built-in.o.prev
 476418   90811   44632  611861   95615 mm/built-in.o.after

E.g. prep_compound_page
before:
4c6b:       c7 47 68 01 00 00 00    movl   $0x1,0x68(%rdi)
4c72:       89 77 6c                mov    %esi,0x6c(%rdi)
after:
4c6c:       66 c7 47 68 01 00       movw   $0x1,0x68(%rdi)
4c72:       66 89 77 6a             mov    %si,0x6a(%rdi)

which looks very similar to me but I am not an expert here so it might
possible that movw is slower.

__free_pages_ok
before:
63af:       8b 77 6c                mov    0x6c(%rdi),%esi
after:
63b1:       0f b7 77 6a             movzwl 0x6a(%rdi),%esi

which looks like a worse code to me. Whether this all is measurable or
worth it I dunno. The ifdef is ugly but maybe the ugliness is a destiny
for struct page.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
@ 2015-08-21  7:13       ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-21  7:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, David Rientjes, linux-kernel,
	linux-mm

On Thu 20-08-15 16:26:04, Andrew Morton wrote:
> On Wed, 19 Aug 2015 12:21:44 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > The patch halves space occupied by compound_dtor and compound_order in
> > struct page.
> > 
> > For compound_order, it's trivial long -> int/short conversion.
> > 
> > For get_compound_page_dtor(), we now use hardcoded table for destructor
> > lookup and store its index in the struct page instead of direct pointer
> > to destructor. It shouldn't be a big trouble to maintain the table: we
> > have only two destructor and NULL currently.
> > 
> > This patch free up one word in tail pages for reuse. This is preparation
> > for the next patch.
> > 
> > ...
> >
> > @@ -145,8 +143,13 @@ struct page {
> >  						 */
> >  		/* First tail page of compound page */
> >  		struct {
> > -			compound_page_dtor *compound_dtor;
> > -			unsigned long compound_order;
> > +#ifdef CONFIG_64BIT
> > +			unsigned int compound_dtor;
> > +			unsigned int compound_order;
> > +#else
> > +			unsigned short int compound_dtor;
> > +			unsigned short int compound_order;
> > +#endif
> 
> Why not use ushort for 64-bit as well?

Yeah, I have asked the same in the previous round. So I've tried to
compile with ushort. The resulting code was slightly larger
   text    data     bss     dec     hex filename
 476370   90811   44632  611813   955e5 mm/built-in.o.prev
 476418   90811   44632  611861   95615 mm/built-in.o.after

E.g. prep_compound_page
before:
4c6b:       c7 47 68 01 00 00 00    movl   $0x1,0x68(%rdi)
4c72:       89 77 6c                mov    %esi,0x6c(%rdi)
after:
4c6c:       66 c7 47 68 01 00       movw   $0x1,0x68(%rdi)
4c72:       66 89 77 6a             mov    %si,0x6a(%rdi)

which looks very similar to me but I am not an expert here so it might
possible that movw is slower.

__free_pages_ok
before:
63af:       8b 77 6c                mov    0x6c(%rdi),%esi
after:
63b1:       0f b7 77 6a             movzwl 0x6a(%rdi),%esi

which looks like a worse code to me. Whether this all is measurable or
worth it I dunno. The ifdef is ugly but maybe the ugliness is a destiny
for struct page.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
  2015-08-21  7:13       ` Michal Hocko
@ 2015-08-21 10:40         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-21 10:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	David Rientjes, linux-kernel, linux-mm

On Fri, Aug 21, 2015 at 09:13:42AM +0200, Michal Hocko wrote:
> On Thu 20-08-15 16:26:04, Andrew Morton wrote:
> > On Wed, 19 Aug 2015 12:21:44 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> > > The patch halves space occupied by compound_dtor and compound_order in
> > > struct page.
> > > 
> > > For compound_order, it's trivial long -> int/short conversion.
> > > 
> > > For get_compound_page_dtor(), we now use hardcoded table for destructor
> > > lookup and store its index in the struct page instead of direct pointer
> > > to destructor. It shouldn't be a big trouble to maintain the table: we
> > > have only two destructor and NULL currently.
> > > 
> > > This patch free up one word in tail pages for reuse. This is preparation
> > > for the next patch.
> > > 
> > > ...
> > >
> > > @@ -145,8 +143,13 @@ struct page {
> > >  						 */
> > >  		/* First tail page of compound page */
> > >  		struct {
> > > -			compound_page_dtor *compound_dtor;
> > > -			unsigned long compound_order;
> > > +#ifdef CONFIG_64BIT
> > > +			unsigned int compound_dtor;
> > > +			unsigned int compound_order;
> > > +#else
> > > +			unsigned short int compound_dtor;
> > > +			unsigned short int compound_order;
> > > +#endif
> > 
> > Why not use ushort for 64-bit as well?
> 
> Yeah, I have asked the same in the previous round. So I've tried to
> compile with ushort. The resulting code was slightly larger
>    text    data     bss     dec     hex filename
>  476370   90811   44632  611813   955e5 mm/built-in.o.prev
>  476418   90811   44632  611861   95615 mm/built-in.o.after
> 
> E.g. prep_compound_page
> before:
> 4c6b:       c7 47 68 01 00 00 00    movl   $0x1,0x68(%rdi)
> 4c72:       89 77 6c                mov    %esi,0x6c(%rdi)
> after:
> 4c6c:       66 c7 47 68 01 00       movw   $0x1,0x68(%rdi)
> 4c72:       66 89 77 6a             mov    %si,0x6a(%rdi)
> 
> which looks very similar to me but I am not an expert here so it might
> possible that movw is slower.
> 
> __free_pages_ok
> before:
> 63af:       8b 77 6c                mov    0x6c(%rdi),%esi
> after:
> 63b1:       0f b7 77 6a             movzwl 0x6a(%rdi),%esi
> 
> which looks like a worse code to me. Whether this all is measurable or
> worth it I dunno. The ifdef is ugly but maybe the ugliness is a destiny
> for struct page.

I don't care about the ifdef that much. If you guys prefer to drop it I'm
fine with that.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
@ 2015-08-21 10:40         ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-21 10:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	David Rientjes, linux-kernel, linux-mm

On Fri, Aug 21, 2015 at 09:13:42AM +0200, Michal Hocko wrote:
> On Thu 20-08-15 16:26:04, Andrew Morton wrote:
> > On Wed, 19 Aug 2015 12:21:44 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> > > The patch halves space occupied by compound_dtor and compound_order in
> > > struct page.
> > > 
> > > For compound_order, it's trivial long -> int/short conversion.
> > > 
> > > For get_compound_page_dtor(), we now use hardcoded table for destructor
> > > lookup and store its index in the struct page instead of direct pointer
> > > to destructor. It shouldn't be a big trouble to maintain the table: we
> > > have only two destructor and NULL currently.
> > > 
> > > This patch free up one word in tail pages for reuse. This is preparation
> > > for the next patch.
> > > 
> > > ...
> > >
> > > @@ -145,8 +143,13 @@ struct page {
> > >  						 */
> > >  		/* First tail page of compound page */
> > >  		struct {
> > > -			compound_page_dtor *compound_dtor;
> > > -			unsigned long compound_order;
> > > +#ifdef CONFIG_64BIT
> > > +			unsigned int compound_dtor;
> > > +			unsigned int compound_order;
> > > +#else
> > > +			unsigned short int compound_dtor;
> > > +			unsigned short int compound_order;
> > > +#endif
> > 
> > Why not use ushort for 64-bit as well?
> 
> Yeah, I have asked the same in the previous round. So I've tried to
> compile with ushort. The resulting code was slightly larger
>    text    data     bss     dec     hex filename
>  476370   90811   44632  611813   955e5 mm/built-in.o.prev
>  476418   90811   44632  611861   95615 mm/built-in.o.after
> 
> E.g. prep_compound_page
> before:
> 4c6b:       c7 47 68 01 00 00 00    movl   $0x1,0x68(%rdi)
> 4c72:       89 77 6c                mov    %esi,0x6c(%rdi)
> after:
> 4c6c:       66 c7 47 68 01 00       movw   $0x1,0x68(%rdi)
> 4c72:       66 89 77 6a             mov    %si,0x6a(%rdi)
> 
> which looks very similar to me but I am not an expert here so it might
> possible that movw is slower.
> 
> __free_pages_ok
> before:
> 63af:       8b 77 6c                mov    0x6c(%rdi),%esi
> after:
> 63b1:       0f b7 77 6a             movzwl 0x6a(%rdi),%esi
> 
> which looks like a worse code to me. Whether this all is measurable or
> worth it I dunno. The ifdef is ugly but maybe the ugliness is a destiny
> for struct page.

I don't care about the ifdef that much. If you guys prefer to drop it I'm
fine with that.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
  2015-08-21 10:40         ` Kirill A. Shutemov
@ 2015-08-21 10:51           ` Michal Hocko
  -1 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-21 10:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	David Rientjes, linux-kernel, linux-mm

On Fri 21-08-15 13:40:59, Kirill A. Shutemov wrote:
> On Fri, Aug 21, 2015 at 09:13:42AM +0200, Michal Hocko wrote:
> > On Thu 20-08-15 16:26:04, Andrew Morton wrote:
[...]
> > > Why not use ushort for 64-bit as well?
> > 
> > Yeah, I have asked the same in the previous round. So I've tried to
> > compile with ushort. The resulting code was slightly larger
> >    text    data     bss     dec     hex filename
> >  476370   90811   44632  611813   955e5 mm/built-in.o.prev
> >  476418   90811   44632  611861   95615 mm/built-in.o.after
> > 
> > E.g. prep_compound_page
> > before:
> > 4c6b:       c7 47 68 01 00 00 00    movl   $0x1,0x68(%rdi)
> > 4c72:       89 77 6c                mov    %esi,0x6c(%rdi)
> > after:
> > 4c6c:       66 c7 47 68 01 00       movw   $0x1,0x68(%rdi)
> > 4c72:       66 89 77 6a             mov    %si,0x6a(%rdi)
> > 
> > which looks very similar to me but I am not an expert here so it might
> > possible that movw is slower.
> > 
> > __free_pages_ok
> > before:
> > 63af:       8b 77 6c                mov    0x6c(%rdi),%esi
> > after:
> > 63b1:       0f b7 77 6a             movzwl 0x6a(%rdi),%esi
> > 
> > which looks like a worse code to me. Whether this all is measurable or
> > worth it I dunno. The ifdef is ugly but maybe the ugliness is a destiny
> > for struct page.
> 
> I don't care about the ifdef that much. If you guys prefer to drop it I'm
> fine with that.

I can live with it. It makes the struct more complicated which is what
struck me. If there is a good reason and a better generated code is a
good one then I do not object but please make it a separate patch so
that we do not wonder why this has been done in the future.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page
@ 2015-08-21 10:51           ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-21 10:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	David Rientjes, linux-kernel, linux-mm

On Fri 21-08-15 13:40:59, Kirill A. Shutemov wrote:
> On Fri, Aug 21, 2015 at 09:13:42AM +0200, Michal Hocko wrote:
> > On Thu 20-08-15 16:26:04, Andrew Morton wrote:
[...]
> > > Why not use ushort for 64-bit as well?
> > 
> > Yeah, I have asked the same in the previous round. So I've tried to
> > compile with ushort. The resulting code was slightly larger
> >    text    data     bss     dec     hex filename
> >  476370   90811   44632  611813   955e5 mm/built-in.o.prev
> >  476418   90811   44632  611861   95615 mm/built-in.o.after
> > 
> > E.g. prep_compound_page
> > before:
> > 4c6b:       c7 47 68 01 00 00 00    movl   $0x1,0x68(%rdi)
> > 4c72:       89 77 6c                mov    %esi,0x6c(%rdi)
> > after:
> > 4c6c:       66 c7 47 68 01 00       movw   $0x1,0x68(%rdi)
> > 4c72:       66 89 77 6a             mov    %si,0x6a(%rdi)
> > 
> > which looks very similar to me but I am not an expert here so it might
> > possible that movw is slower.
> > 
> > __free_pages_ok
> > before:
> > 63af:       8b 77 6c                mov    0x6c(%rdi),%esi
> > after:
> > 63b1:       0f b7 77 6a             movzwl 0x6a(%rdi),%esi
> > 
> > which looks like a worse code to me. Whether this all is measurable or
> > worth it I dunno. The ifdef is ugly but maybe the ugliness is a destiny
> > for struct page.
> 
> I don't care about the ifdef that much. If you guys prefer to drop it I'm
> fine with that.

I can live with it. It makes the struct more complicated which is what
struck me. If there is a good reason and a better generated code is a
good one then I do not object but please make it a separate patch so
that we do not wonder why this has been done in the future.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-20 23:36     ` Andrew Morton
@ 2015-08-21 12:10       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-21 12:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, David Rientjes,
	linux-kernel, linux-mm, Christoph Lameter

On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> > 	CPU0					CPU1
> > 
> > isolate_migratepages_block()
> >   page_count()
> >     compound_head()
> >       !!PageTail() == true
> > 					put_page()
> > 					  tail->first_page = NULL
> >       head = tail->first_page
> > 					alloc_pages(__GFP_COMP)
> > 					   prep_compound_page()
> > 					     tail->first_page = head
> > 					     __SetPageTail(p);
> >       !!PageTail() == true
> >     <head == NULL dereferencing>
> > 
> > The race is pure theoretical. I don't it's possible to trigger it in
> > practice. But who knows.
> > 
> > We can fix the race by changing how encode PageTail() and compound_head()
> > within struct page to be able to update them in one shot.
> > 
> > The patch introduces page->compound_head into third double word block in
> > front of compound_dtor and compound_order. That means it shares storage
> > space with:
> > 
> >  - page->lru.next;
> >  - page->next;
> >  - page->rcu_head.next;
> >  - page->pmd_huge_pte;
> > 
> > That's too long list to be absolutely sure, but looks like nobody uses
> > bit 0 of the word. It can be used to encode PageTail(). And if the bit
> > set, rest of the word is pointer to head page.
> 
> So nothing else which participates in the union in the "Third double
> word block" is allowed to use bit zero of the first word.

Correct.

> Is this really true?  For example if it's a slab page, will that page
> ever be inspected by code which is looking for the PageTail bit?

+Christoph.

What we know for sure is that space is not used in tail pages, otherwise
it would collide with current compound_dtor.

For head/small pages it gets trickier. I convinced myself that it should
be safe this way:

All fields it shares space with are pointers (with possible exception of
pmd_huge_pte, see below) to objects with sizeof() > 1. I think it's
reasonable to expect that the bit 0 in such pointers would be clear due
alignment. We do the same for page->mapping.

On pmd_huge_pte: it's pgtable_t which on most architectures is typedef to
struct page *. That should not create any conflicts. On some architectures
it's pte_t *, which is fine too. On arc it's virtual address of the page
in form of unsigned long. It should work.

The worry I have about pmd_huge_pte is that some new architecture may
choose to implement pgtable_t as pfn and that will collide on bit 0. :-/

We can address this worry by shifting pmd_huge_pte to the second word in
the double word block. But I'm not sure if we should.

And of course there's chance that these field are used not according to
its type. I didn't find such cases, but I can't guarantee that they don't
exist.

I tested patched kernel with all three SLAB allocator and was not able to
crash it under trinity. More testing is required.

> Anyway, this is quite subtle and there's a risk that people will
> accidentally break it later on.  I don't think the patch puts
> sufficient documentation in place to prevent this.

I would appreciate for suggestion on place and form of documentation.

> And even documentation might not be enough to prevent accidents.

The only think I can propose is VM_BUG_ON() in PageTail() and
compound_head() which would ensure that page->compound_page points to
place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.

Do you consider this helpful?

> >
> > ...
> >
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -120,7 +120,12 @@ struct page {
> >  		};
> >  	};
> >  
> > -	/* Third double word block */
> > +	/*
> > +	 * Third double word block
> > +	 *
> > +	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> > +	 * for non-tail pages.
> > +	 */
> >  	union {
> >  		struct list_head lru;	/* Pageout list, eg. active_list
> >  					 * protected by zone->lru_lock !
> > @@ -143,6 +148,7 @@ struct page {
> >  						 */
> >  		/* First tail page of compound page */
> >  		struct {
> > +			unsigned long compound_head; /* If bit zero is set */
> 
> I think the comments around here should have more details and should
> be louder!

I'm always bad when it comes to documentation. Is it enough?

	/*
	 * Third double word block
	 *
	 * WARNING: bit 0 of the first word encode PageTail(). That means
	 * the rest users of the storage space MUST NOT use the bit to
	 * avoid collision and false-positive PageTail().
	 */
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-21 12:10       ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-21 12:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, David Rientjes,
	linux-kernel, linux-mm, Christoph Lameter

On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> > 	CPU0					CPU1
> > 
> > isolate_migratepages_block()
> >   page_count()
> >     compound_head()
> >       !!PageTail() == true
> > 					put_page()
> > 					  tail->first_page = NULL
> >       head = tail->first_page
> > 					alloc_pages(__GFP_COMP)
> > 					   prep_compound_page()
> > 					     tail->first_page = head
> > 					     __SetPageTail(p);
> >       !!PageTail() == true
> >     <head == NULL dereferencing>
> > 
> > The race is pure theoretical. I don't it's possible to trigger it in
> > practice. But who knows.
> > 
> > We can fix the race by changing how encode PageTail() and compound_head()
> > within struct page to be able to update them in one shot.
> > 
> > The patch introduces page->compound_head into third double word block in
> > front of compound_dtor and compound_order. That means it shares storage
> > space with:
> > 
> >  - page->lru.next;
> >  - page->next;
> >  - page->rcu_head.next;
> >  - page->pmd_huge_pte;
> > 
> > That's too long list to be absolutely sure, but looks like nobody uses
> > bit 0 of the word. It can be used to encode PageTail(). And if the bit
> > set, rest of the word is pointer to head page.
> 
> So nothing else which participates in the union in the "Third double
> word block" is allowed to use bit zero of the first word.

Correct.

> Is this really true?  For example if it's a slab page, will that page
> ever be inspected by code which is looking for the PageTail bit?

+Christoph.

What we know for sure is that space is not used in tail pages, otherwise
it would collide with current compound_dtor.

For head/small pages it gets trickier. I convinced myself that it should
be safe this way:

All fields it shares space with are pointers (with possible exception of
pmd_huge_pte, see below) to objects with sizeof() > 1. I think it's
reasonable to expect that the bit 0 in such pointers would be clear due
alignment. We do the same for page->mapping.

On pmd_huge_pte: it's pgtable_t which on most architectures is typedef to
struct page *. That should not create any conflicts. On some architectures
it's pte_t *, which is fine too. On arc it's virtual address of the page
in form of unsigned long. It should work.

The worry I have about pmd_huge_pte is that some new architecture may
choose to implement pgtable_t as pfn and that will collide on bit 0. :-/

We can address this worry by shifting pmd_huge_pte to the second word in
the double word block. But I'm not sure if we should.

And of course there's chance that these field are used not according to
its type. I didn't find such cases, but I can't guarantee that they don't
exist.

I tested patched kernel with all three SLAB allocator and was not able to
crash it under trinity. More testing is required.

> Anyway, this is quite subtle and there's a risk that people will
> accidentally break it later on.  I don't think the patch puts
> sufficient documentation in place to prevent this.

I would appreciate for suggestion on place and form of documentation.

> And even documentation might not be enough to prevent accidents.

The only think I can propose is VM_BUG_ON() in PageTail() and
compound_head() which would ensure that page->compound_page points to
place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.

Do you consider this helpful?

> >
> > ...
> >
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -120,7 +120,12 @@ struct page {
> >  		};
> >  	};
> >  
> > -	/* Third double word block */
> > +	/*
> > +	 * Third double word block
> > +	 *
> > +	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> > +	 * for non-tail pages.
> > +	 */
> >  	union {
> >  		struct list_head lru;	/* Pageout list, eg. active_list
> >  					 * protected by zone->lru_lock !
> > @@ -143,6 +148,7 @@ struct page {
> >  						 */
> >  		/* First tail page of compound page */
> >  		struct {
> > +			unsigned long compound_head; /* If bit zero is set */
> 
> I think the comments around here should have more details and should
> be louder!

I'm always bad when it comes to documentation. Is it enough?

	/*
	 * Third double word block
	 *
	 * WARNING: bit 0 of the first word encode PageTail(). That means
	 * the rest users of the storage space MUST NOT use the bit to
	 * avoid collision and false-positive PageTail().
	 */
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-21 12:10       ` Kirill A. Shutemov
@ 2015-08-21 16:11         ` Christoph Lameter
  -1 siblings, 0 replies; 96+ messages in thread
From: Christoph Lameter @ 2015-08-21 16:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:

> > Is this really true?  For example if it's a slab page, will that page
> > ever be inspected by code which is looking for the PageTail bit?
>
> +Christoph.
>
> What we know for sure is that space is not used in tail pages, otherwise
> it would collide with current compound_dtor.

Sl*b allocators only do a virt_to_head_page on tail pages.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-21 16:11         ` Christoph Lameter
  0 siblings, 0 replies; 96+ messages in thread
From: Christoph Lameter @ 2015-08-21 16:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:

> > Is this really true?  For example if it's a slab page, will that page
> > ever be inspected by code which is looking for the PageTail bit?
>
> +Christoph.
>
> What we know for sure is that space is not used in tail pages, otherwise
> it would collide with current compound_dtor.

Sl*b allocators only do a virt_to_head_page on tail pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-21 16:11         ` Christoph Lameter
@ 2015-08-21 19:31           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-21 19:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

On Fri, Aug 21, 2015 at 11:11:27AM -0500, Christoph Lameter wrote:
> On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:
> 
> > > Is this really true?  For example if it's a slab page, will that page
> > > ever be inspected by code which is looking for the PageTail bit?
> >
> > +Christoph.
> >
> > What we know for sure is that space is not used in tail pages, otherwise
> > it would collide with current compound_dtor.
> 
> Sl*b allocators only do a virt_to_head_page on tail pages.

The question was whether it's safe to assume that the bit 0 is always zero
in the word as this bit will encode PageTail().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-21 19:31           ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-21 19:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

On Fri, Aug 21, 2015 at 11:11:27AM -0500, Christoph Lameter wrote:
> On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:
> 
> > > Is this really true?  For example if it's a slab page, will that page
> > > ever be inspected by code which is looking for the PageTail bit?
> >
> > +Christoph.
> >
> > What we know for sure is that space is not used in tail pages, otherwise
> > it would collide with current compound_dtor.
> 
> Sl*b allocators only do a virt_to_head_page on tail pages.

The question was whether it's safe to assume that the bit 0 is always zero
in the word as this bit will encode PageTail().

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-21 19:31           ` Kirill A. Shutemov
@ 2015-08-21 19:34             ` Andrew Morton
  -1 siblings, 0 replies; 96+ messages in thread
From: Andrew Morton @ 2015-08-21 19:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Christoph Lameter, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

On Fri, 21 Aug 2015 22:31:09 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Fri, Aug 21, 2015 at 11:11:27AM -0500, Christoph Lameter wrote:
> > On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:
> > 
> > > > Is this really true?  For example if it's a slab page, will that page
> > > > ever be inspected by code which is looking for the PageTail bit?
> > >
> > > +Christoph.
> > >
> > > What we know for sure is that space is not used in tail pages, otherwise
> > > it would collide with current compound_dtor.
> > 
> > Sl*b allocators only do a virt_to_head_page on tail pages.
> 
> The question was whether it's safe to assume that the bit 0 is always zero
> in the word as this bit will encode PageTail().

That wasn't my question actually...

What I'm wondering is: if this page is being used for slab, will any
code path ever run PageTail() against it?  If not, we don't need to be
concerned about that bit.

And slab was just the example I chose.  The same question petains to
all other uses of that union.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-21 19:34             ` Andrew Morton
  0 siblings, 0 replies; 96+ messages in thread
From: Andrew Morton @ 2015-08-21 19:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Christoph Lameter, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

On Fri, 21 Aug 2015 22:31:09 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Fri, Aug 21, 2015 at 11:11:27AM -0500, Christoph Lameter wrote:
> > On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:
> > 
> > > > Is this really true?  For example if it's a slab page, will that page
> > > > ever be inspected by code which is looking for the PageTail bit?
> > >
> > > +Christoph.
> > >
> > > What we know for sure is that space is not used in tail pages, otherwise
> > > it would collide with current compound_dtor.
> > 
> > Sl*b allocators only do a virt_to_head_page on tail pages.
> 
> The question was whether it's safe to assume that the bit 0 is always zero
> in the word as this bit will encode PageTail().

That wasn't my question actually...

What I'm wondering is: if this page is being used for slab, will any
code path ever run PageTail() against it?  If not, we don't need to be
concerned about that bit.

And slab was just the example I chose.  The same question petains to
all other uses of that union.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-21 19:34             ` Andrew Morton
@ 2015-08-21 21:15               ` Christoph Lameter
  -1 siblings, 0 replies; 96+ messages in thread
From: Christoph Lameter @ 2015-08-21 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

On Fri, 21 Aug 2015, Andrew Morton wrote:

> On Fri, 21 Aug 2015 22:31:09 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>
> > On Fri, Aug 21, 2015 at 11:11:27AM -0500, Christoph Lameter wrote:
> > > On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:
> > >
> > > > > Is this really true?  For example if it's a slab page, will that page
> > > > > ever be inspected by code which is looking for the PageTail bit?
> > > >
> > > > +Christoph.
> > > >
> > > > What we know for sure is that space is not used in tail pages, otherwise
> > > > it would collide with current compound_dtor.
> > >
> > > Sl*b allocators only do a virt_to_head_page on tail pages.
> >
> > The question was whether it's safe to assume that the bit 0 is always zero
> > in the word as this bit will encode PageTail().
>
> That wasn't my question actually...
>
> What I'm wondering is: if this page is being used for slab, will any
> code path ever run PageTail() against it?  If not, we don't need to be
> concerned about that bit.

virt_to_head_page will run PageTail because it uses compound_head(). And
compound_head needs to use the first_page pointer if its a tail page.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-21 21:15               ` Christoph Lameter
  0 siblings, 0 replies; 96+ messages in thread
From: Christoph Lameter @ 2015-08-21 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm

On Fri, 21 Aug 2015, Andrew Morton wrote:

> On Fri, 21 Aug 2015 22:31:09 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>
> > On Fri, Aug 21, 2015 at 11:11:27AM -0500, Christoph Lameter wrote:
> > > On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:
> > >
> > > > > Is this really true?  For example if it's a slab page, will that page
> > > > > ever be inspected by code which is looking for the PageTail bit?
> > > >
> > > > +Christoph.
> > > >
> > > > What we know for sure is that space is not used in tail pages, otherwise
> > > > it would collide with current compound_dtor.
> > >
> > > Sl*b allocators only do a virt_to_head_page on tail pages.
> >
> > The question was whether it's safe to assume that the bit 0 is always zero
> > in the word as this bit will encode PageTail().
>
> That wasn't my question actually...
>
> What I'm wondering is: if this page is being used for slab, will any
> code path ever run PageTail() against it?  If not, we don't need to be
> concerned about that bit.

virt_to_head_page will run PageTail because it uses compound_head(). And
compound_head needs to use the first_page pointer if its a tail page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 0/5] Fix compound_head() race
  2015-08-20 23:38     ` Andrew Morton
@ 2015-08-22 20:13       ` Hugh Dickins
  -1 siblings, 0 replies; 96+ messages in thread
From: Hugh Dickins @ 2015-08-22 20:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Mel Gorman, Hugh Dickins, Andrea Arcangeli,
	Dave Hansen, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm

On Thu, 20 Aug 2015, Andrew Morton wrote:
> On Thu, 20 Aug 2015 15:31:07 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > On Wed, Aug 19, 2015 at 12:21:41PM +0300, Kirill A. Shutemov wrote:
> > > Here's my attempt on fixing recently discovered race in compound_head().
> > > It should make compound_head() reliable in all contexts.
> > > 
> > > The patchset is against Linus' tree. Let me know if it need to be rebased
> > > onto different baseline.
> > > 
> > > It's expected to have conflicts with my page-flags patchset and probably
> > > should be applied before it.
> > > 
> > > v3:
> > >    - Fix build without hugetlb;
> > >    - Drop page->first_page;
> > >    - Update comment for free_compound_page();
> > >    - Use 'unsigned int' for page order;
> > > 
> > > v2: Per Hugh's suggestion page->compound_head is moved into third double
> > >     word. This way we can avoid memory overhead which v1 had in some
> > >     cases.
> > > 
> > >     This place in struct page is rather overloaded. More testing is
> > >     required to make sure we don't collide with anyone.
> > 
> > Andrew, can we have the patchset applied, if nobody has objections?
> 
> I've been hoping to hear from Hugh and I wasn't planning on processing
> these before the 4.2 release.

I think this patchset is very good, in a variety of different ways.

Fixes a tricky race, deletes more code than it adds, shrinks kernel text,
deletes tricky functions relying on barriers, frees up a page flag bit,
removes a discrepancy between configs, is really neat in how PageTail
is necessarily false on all lru and lru-candidate pages, probably more.
Good job.

Yes, I did think the compound destructor enum stuff over-engineered,
and would have preferred just direct calls to free_compound_page()
or free_huge_page() myself.  But when I tried to make a patch on
top to do that, even when I left PageHuge out-of-line (which had
certainly not been my intention), it still generated more kernel
text than Kirill's enum version (maybe his "- 1" in compound_head
works better in some places than masking out 3, I didn't study);
so let's forget about that.

I've not actually run and tested with it, but I shall be pleased
when it gets in to mmotm, and will do so then.

As to whether it answers my doubts about his patch-flags patchset
already in mmotm (not your question here, Andrew, but Kirill's in
another of these mails): I'd say that it greatly reduces my doubts,
but does not entirely set me at ease with the bloat.

This set here gives us a compound_head() that is safe to tuck
inside PageFlags ops in that set there: that doesn't worry me
any more.  And the bloat is reduced enough that I don't think
it should be allowed to block Kirill's progress.

But I can't shake off the idea that someone somewhere (0day perf
results? Mel on an __spree?) is going to need to shave away some
of these hidden and rarely needed compound_head() calls one day.

Take __activate_page() in mm/swap.c as an example, something that
begins with a bold PageLRU && !PageActive && !PageUnevictable.
That function contains six sequences of the form
mov 0x20(%rdi),%rax; test $0x1,%al; je over_next; sub $0x1,%rax.

Five of which I expect could be avoided if we just did a
compound_head() conversion on entry.  I suppose any branch
predictor will do a fine job with the last five: am I just too
old-fashioned to be thinking we should (have the ability to)
eliminate them completely?

I'm not saying that we need to convert __activate_page, or anything
else, at this time; but I do think we shall want diet versions of
at least the simple PageFlags tests themselves (we should already
be sparing with the atomic ones), and need to establish convention
now for what the diet versions of PageFlags will be called.

Would __PageFlag be good enough?  Could we say that
__SetPageFlag and __ClearPageFlag omit the compound_head() -
we already have to think carefully when applying those?

Hugh

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 0/5] Fix compound_head() race
@ 2015-08-22 20:13       ` Hugh Dickins
  0 siblings, 0 replies; 96+ messages in thread
From: Hugh Dickins @ 2015-08-22 20:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Mel Gorman, Hugh Dickins, Andrea Arcangeli,
	Dave Hansen, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm

On Thu, 20 Aug 2015, Andrew Morton wrote:
> On Thu, 20 Aug 2015 15:31:07 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > On Wed, Aug 19, 2015 at 12:21:41PM +0300, Kirill A. Shutemov wrote:
> > > Here's my attempt on fixing recently discovered race in compound_head().
> > > It should make compound_head() reliable in all contexts.
> > > 
> > > The patchset is against Linus' tree. Let me know if it need to be rebased
> > > onto different baseline.
> > > 
> > > It's expected to have conflicts with my page-flags patchset and probably
> > > should be applied before it.
> > > 
> > > v3:
> > >    - Fix build without hugetlb;
> > >    - Drop page->first_page;
> > >    - Update comment for free_compound_page();
> > >    - Use 'unsigned int' for page order;
> > > 
> > > v2: Per Hugh's suggestion page->compound_head is moved into third double
> > >     word. This way we can avoid memory overhead which v1 had in some
> > >     cases.
> > > 
> > >     This place in struct page is rather overloaded. More testing is
> > >     required to make sure we don't collide with anyone.
> > 
> > Andrew, can we have the patchset applied, if nobody has objections?
> 
> I've been hoping to hear from Hugh and I wasn't planning on processing
> these before the 4.2 release.

I think this patchset is very good, in a variety of different ways.

Fixes a tricky race, deletes more code than it adds, shrinks kernel text,
deletes tricky functions relying on barriers, frees up a page flag bit,
removes a discrepancy between configs, is really neat in how PageTail
is necessarily false on all lru and lru-candidate pages, probably more.
Good job.

Yes, I did think the compound destructor enum stuff over-engineered,
and would have preferred just direct calls to free_compound_page()
or free_huge_page() myself.  But when I tried to make a patch on
top to do that, even when I left PageHuge out-of-line (which had
certainly not been my intention), it still generated more kernel
text than Kirill's enum version (maybe his "- 1" in compound_head
works better in some places than masking out 3, I didn't study);
so let's forget about that.

I've not actually run and tested with it, but I shall be pleased
when it gets in to mmotm, and will do so then.

As to whether it answers my doubts about his patch-flags patchset
already in mmotm (not your question here, Andrew, but Kirill's in
another of these mails): I'd say that it greatly reduces my doubts,
but does not entirely set me at ease with the bloat.

This set here gives us a compound_head() that is safe to tuck
inside PageFlags ops in that set there: that doesn't worry me
any more.  And the bloat is reduced enough that I don't think
it should be allowed to block Kirill's progress.

But I can't shake off the idea that someone somewhere (0day perf
results? Mel on an __spree?) is going to need to shave away some
of these hidden and rarely needed compound_head() calls one day.

Take __activate_page() in mm/swap.c as an example, something that
begins with a bold PageLRU && !PageActive && !PageUnevictable.
That function contains six sequences of the form
mov 0x20(%rdi),%rax; test $0x1,%al; je over_next; sub $0x1,%rax.

Five of which I expect could be avoided if we just did a
compound_head() conversion on entry.  I suppose any branch
predictor will do a fine job with the last five: am I just too
old-fashioned to be thinking we should (have the ability to)
eliminate them completely?

I'm not saying that we need to convert __activate_page, or anything
else, at this time; but I do think we shall want diet versions of
at least the simple PageFlags tests themselves (we should already
be sparing with the atomic ones), and need to establish convention
now for what the diet versions of PageFlags will be called.

Would __PageFlag be good enough?  Could we say that
__SetPageFlag and __ClearPageFlag omit the compound_head() -
we already have to think carefully when applying those?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-19  9:21   ` Kirill A. Shutemov
@ 2015-08-23 23:59     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 96+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-23 23:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, David Rientjes,
	linux-kernel, linux-mm, Christoph Lameter

On Wed, 19 Aug 2015 12:21:45 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
[...]

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0735bc0a351a..a4c4b7d07473 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h

[...]
> -/*
> - * If we access compound page synchronously such as access to
> - * allocated page, there is no need to handle tail flag race, so we can
> - * check tail flag directly without any synchronization primitive.
> - */
> -static inline struct page *compound_head_fast(struct page *page)
> -{
> -	if (unlikely(PageTail(page)))
> -		return page->first_page;
> -	return page;
> -}
> -
[...]

> @@ -548,13 +508,7 @@ static inline struct page *virt_to_head_page(const void *x)
>  {
>  	struct page *page = virt_to_page(x);
>  
> -	/*
> -	 * We don't need to worry about synchronization of tail flag
> -	 * when we call virt_to_head_page() since it is only called for
> -	 * already allocated page and this page won't be freed until
> -	 * this virt_to_head_page() is finished. So use _fast variant.
> -	 */
> -	return compound_head_fast(page);
> +	return compound_head(page);
>  }

I hope this does not slow down the SLAB/slub allocator?
(which calls virt_to_head_page() frequently)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-23 23:59     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 96+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-23 23:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, David Rientjes,
	linux-kernel, linux-mm, Christoph Lameter

On Wed, 19 Aug 2015 12:21:45 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
[...]

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0735bc0a351a..a4c4b7d07473 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h

[...]
> -/*
> - * If we access compound page synchronously such as access to
> - * allocated page, there is no need to handle tail flag race, so we can
> - * check tail flag directly without any synchronization primitive.
> - */
> -static inline struct page *compound_head_fast(struct page *page)
> -{
> -	if (unlikely(PageTail(page)))
> -		return page->first_page;
> -	return page;
> -}
> -
[...]

> @@ -548,13 +508,7 @@ static inline struct page *virt_to_head_page(const void *x)
>  {
>  	struct page *page = virt_to_page(x);
>  
> -	/*
> -	 * We don't need to worry about synchronization of tail flag
> -	 * when we call virt_to_head_page() since it is only called for
> -	 * already allocated page and this page won't be freed until
> -	 * this virt_to_head_page() is finished. So use _fast variant.
> -	 */
> -	return compound_head_fast(page);
> +	return compound_head(page);
>  }

I hope this does not slow down the SLAB/slub allocator?
(which calls virt_to_head_page() frequently)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-23 23:59     ` Jesper Dangaard Brouer
@ 2015-08-24  9:29       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-24  9:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Mon, Aug 24, 2015 at 01:59:45AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 19 Aug 2015 12:21:45 +0300
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> [...]
> 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0735bc0a351a..a4c4b7d07473 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> 
> [...]
> > -/*
> > - * If we access compound page synchronously such as access to
> > - * allocated page, there is no need to handle tail flag race, so we can
> > - * check tail flag directly without any synchronization primitive.
> > - */
> > -static inline struct page *compound_head_fast(struct page *page)
> > -{
> > -	if (unlikely(PageTail(page)))
> > -		return page->first_page;
> > -	return page;
> > -}
> > -
> [...]
> 
> > @@ -548,13 +508,7 @@ static inline struct page *virt_to_head_page(const void *x)
> >  {
> >  	struct page *page = virt_to_page(x);
> >  
> > -	/*
> > -	 * We don't need to worry about synchronization of tail flag
> > -	 * when we call virt_to_head_page() since it is only called for
> > -	 * already allocated page and this page won't be freed until
> > -	 * this virt_to_head_page() is finished. So use _fast variant.
> > -	 */
> > -	return compound_head_fast(page);
> > +	return compound_head(page);
> >  }
> 
> I hope this does not slow down the SLAB/slub allocator?
> (which calls virt_to_head_page() frequently)

It should be slightly faster.

Before:

00002e90 <test_virt_to_head_page>:
    2e90:	8b 15 00 00 00 00    	mov    0x0,%edx
    2e96:	05 00 00 00 40       	add    $0x40000000,%eax
    2e9b:	c1 e8 0c             	shr    $0xc,%eax
    2e9e:	c1 e0 05             	shl    $0x5,%eax
    2ea1:	01 d0                	add    %edx,%eax
    2ea3:	8b 10                	mov    (%eax),%edx
    2ea5:	f6 c6 80             	test   $0x80,%dh
    2ea8:	75 06                	jne    2eb0 <test_virt_to_head_page+0x20>
    2eaa:	c3                   	ret    
    2eab:	90                   	nop
    2eac:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
    2eb0:	8b 40 1c             	mov    0x1c(%eax),%eax
    2eb3:	c3                   	ret    

After:

00003070 <test_virt_to_head_page>:
    3070:	8b 15 00 00 00 00    	mov    0x0,%edx
    3076:	05 00 00 00 40       	add    $0x40000000,%eax
    307b:	c1 e8 0c             	shr    $0xc,%eax
    307e:	c1 e0 05             	shl    $0x5,%eax
    3081:	01 d0                	add    %edx,%eax
    3083:	8b 50 14             	mov    0x14(%eax),%edx
    3086:	8d 4a ff             	lea    -0x1(%edx),%ecx
    3089:	f6 c2 01             	test   $0x1,%dl
    308c:	0f 45 c1             	cmovne %ecx,%eax
    308f:	c3                   	ret    

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-24  9:29       ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-24  9:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Mon, Aug 24, 2015 at 01:59:45AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 19 Aug 2015 12:21:45 +0300
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> [...]
> 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0735bc0a351a..a4c4b7d07473 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> 
> [...]
> > -/*
> > - * If we access compound page synchronously such as access to
> > - * allocated page, there is no need to handle tail flag race, so we can
> > - * check tail flag directly without any synchronization primitive.
> > - */
> > -static inline struct page *compound_head_fast(struct page *page)
> > -{
> > -	if (unlikely(PageTail(page)))
> > -		return page->first_page;
> > -	return page;
> > -}
> > -
> [...]
> 
> > @@ -548,13 +508,7 @@ static inline struct page *virt_to_head_page(const void *x)
> >  {
> >  	struct page *page = virt_to_page(x);
> >  
> > -	/*
> > -	 * We don't need to worry about synchronization of tail flag
> > -	 * when we call virt_to_head_page() since it is only called for
> > -	 * already allocated page and this page won't be freed until
> > -	 * this virt_to_head_page() is finished. So use _fast variant.
> > -	 */
> > -	return compound_head_fast(page);
> > +	return compound_head(page);
> >  }
> 
> I hope this does not slow down the SLAB/slub allocator?
> (which calls virt_to_head_page() frequently)

It should be slightly faster.

Before:

00002e90 <test_virt_to_head_page>:
    2e90:	8b 15 00 00 00 00    	mov    0x0,%edx
    2e96:	05 00 00 00 40       	add    $0x40000000,%eax
    2e9b:	c1 e8 0c             	shr    $0xc,%eax
    2e9e:	c1 e0 05             	shl    $0x5,%eax
    2ea1:	01 d0                	add    %edx,%eax
    2ea3:	8b 10                	mov    (%eax),%edx
    2ea5:	f6 c6 80             	test   $0x80,%dh
    2ea8:	75 06                	jne    2eb0 <test_virt_to_head_page+0x20>
    2eaa:	c3                   	ret    
    2eab:	90                   	nop
    2eac:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
    2eb0:	8b 40 1c             	mov    0x1c(%eax),%eax
    2eb3:	c3                   	ret    

After:

00003070 <test_virt_to_head_page>:
    3070:	8b 15 00 00 00 00    	mov    0x0,%edx
    3076:	05 00 00 00 40       	add    $0x40000000,%eax
    307b:	c1 e8 0c             	shr    $0xc,%eax
    307e:	c1 e0 05             	shl    $0x5,%eax
    3081:	01 d0                	add    %edx,%eax
    3083:	8b 50 14             	mov    0x14(%eax),%edx
    3086:	8d 4a ff             	lea    -0x1(%edx),%ecx
    3089:	f6 c2 01             	test   $0x1,%dl
    308c:	0f 45 c1             	cmovne %ecx,%eax
    308f:	c3                   	ret    

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 0/5] Fix compound_head() race
  2015-08-22 20:13       ` Hugh Dickins
@ 2015-08-24  9:36         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-24  9:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, David Rientjes,
	linux-kernel, linux-mm

On Sat, Aug 22, 2015 at 01:13:19PM -0700, Hugh Dickins wrote:
> Yes, I did think the compound destructor enum stuff over-engineered,
> and would have preferred just direct calls to free_compound_page()
> or free_huge_page() myself.  But when I tried to make a patch on
> top to do that, even when I left PageHuge out-of-line (which had
> certainly not been my intention), it still generated more kernel
> text than Kirill's enum version (maybe his "- 1" in compound_head
> works better in some places than masking out 3, I didn't study);
> so let's forget about that.

I had my agenda on ->compound_dtor: my refcounting patchset introduces
one more compound destructor. I wanted to avoid hardcoding them here.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 0/5] Fix compound_head() race
@ 2015-08-24  9:36         ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-24  9:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, David Rientjes,
	linux-kernel, linux-mm

On Sat, Aug 22, 2015 at 01:13:19PM -0700, Hugh Dickins wrote:
> Yes, I did think the compound destructor enum stuff over-engineered,
> and would have preferred just direct calls to free_compound_page()
> or free_huge_page() myself.  But when I tried to make a patch on
> top to do that, even when I left PageHuge out-of-line (which had
> certainly not been my intention), it still generated more kernel
> text than Kirill's enum version (maybe his "- 1" in compound_head
> works better in some places than masking out 3, I didn't study);
> so let's forget about that.

I had my agenda on ->compound_dtor: my refcounting patchset introduces
one more compound destructor. I wanted to avoid hardcoding them here.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-19  9:21   ` Kirill A. Shutemov
@ 2015-08-24 10:17     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-24 10:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, David Rientjes,
	linux-kernel, linux-mm

On Wed, Aug 19, 2015 at 12:21:45PM +0300, Kirill A. Shutemov wrote:
> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
> 	CPU0					CPU1
> 
> isolate_migratepages_block()
>   page_count()
>     compound_head()
>       !!PageTail() == true
> 					put_page()
> 					  tail->first_page = NULL
>       head = tail->first_page
> 					alloc_pages(__GFP_COMP)
> 					   prep_compound_page()
> 					     tail->first_page = head
> 					     __SetPageTail(p);
>       !!PageTail() == true
>     <head == NULL dereferencing>
> 
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
> 
> We can fix the race by changing how encode PageTail() and compound_head()
> within struct page to be able to update them in one shot.
> 
> The patch introduces page->compound_head into third double word block in
> front of compound_dtor and compound_order. That means it shares storage
> space with:
> 
>  - page->lru.next;
>  - page->next;
>  - page->rcu_head.next;
>  - page->pmd_huge_pte;
> 
> That's too long list to be absolutely sure, but looks like nobody uses
> bit 0 of the word. It can be used to encode PageTail(). And if the bit
> set, rest of the word is pointer to head page.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>

If DEFERRED_STRUCT_PAGE_INIT=n, combining this patchset with my page-flags
patches causes oops in SetPageReserved() called from
reserve_bootmem_region().

It happens because we haven't yet initilized the word in struct page and
PageTail() inside SetPageReserved() can give false-positive, which leads
to bogus compound_head() result.

IIUC, we initialize the word only on first allocation of the page. It can
be too late: pfn scanner can see false-positive PageTail() from not yet
allocated pages too.

Here's fixlet for patch to address the issue.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 347724850665..d0e3fca830f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -892,6 +892,8 @@ static void init_reserved_page(unsigned long pfn)
 #else
 static inline void init_reserved_page(unsigned long pfn)
 {
+	/* Avoid false-positive PageTail() */
+	INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-24 10:17     ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-24 10:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, David Rientjes,
	linux-kernel, linux-mm

On Wed, Aug 19, 2015 at 12:21:45PM +0300, Kirill A. Shutemov wrote:
> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
> 	CPU0					CPU1
> 
> isolate_migratepages_block()
>   page_count()
>     compound_head()
>       !!PageTail() == true
> 					put_page()
> 					  tail->first_page = NULL
>       head = tail->first_page
> 					alloc_pages(__GFP_COMP)
> 					   prep_compound_page()
> 					     tail->first_page = head
> 					     __SetPageTail(p);
>       !!PageTail() == true
>     <head == NULL dereferencing>
> 
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
> 
> We can fix the race by changing how encode PageTail() and compound_head()
> within struct page to be able to update them in one shot.
> 
> The patch introduces page->compound_head into third double word block in
> front of compound_dtor and compound_order. That means it shares storage
> space with:
> 
>  - page->lru.next;
>  - page->next;
>  - page->rcu_head.next;
>  - page->pmd_huge_pte;
> 
> That's too long list to be absolutely sure, but looks like nobody uses
> bit 0 of the word. It can be used to encode PageTail(). And if the bit
> set, rest of the word is pointer to head page.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>

If DEFERRED_STRUCT_PAGE_INIT=n, combining this patchset with my page-flags
patches causes oops in SetPageReserved() called from
reserve_bootmem_region().

It happens because we haven't yet initilized the word in struct page and
PageTail() inside SetPageReserved() can give false-positive, which leads
to bogus compound_head() result.

IIUC, we initialize the word only on first allocation of the page. It can
be too late: pfn scanner can see false-positive PageTail() from not yet
allocated pages too.

Here's fixlet for patch to address the issue.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 347724850665..d0e3fca830f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -892,6 +892,8 @@ static void init_reserved_page(unsigned long pfn)
 #else
 static inline void init_reserved_page(unsigned long pfn)
 {
+	/* Avoid false-positive PageTail() */
+	INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 1/5] mm: drop page->slab_page
  2015-08-19  9:21   ` Kirill A. Shutemov
@ 2015-08-24 14:59     ` Vlastimil Babka
  -1 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-24 14:59 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Joonsoo Kim, Andi Kleen

On 08/19/2015 11:21 AM, Kirill A. Shutemov wrote:
> Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> uses slab_page field in struct page.
>
> Let's drop it.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andi Kleen <ak@linux.intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   include/linux/mm_types.h | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0038ac7466fd..58620ac7f15c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -140,7 +140,6 @@ struct page {
>   #endif
>   		};
>
> -		struct slab *slab_page; /* slab fields */
>   		struct rcu_head rcu_head;	/* Used by SLAB
>   						 * when destroying via RCU
>   						 */
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 1/5] mm: drop page->slab_page
@ 2015-08-24 14:59     ` Vlastimil Babka
  0 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-24 14:59 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Joonsoo Kim, Andi Kleen

On 08/19/2015 11:21 AM, Kirill A. Shutemov wrote:
> Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> uses slab_page field in struct page.
>
> Let's drop it.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andi Kleen <ak@linux.intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   include/linux/mm_types.h | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0038ac7466fd..58620ac7f15c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -140,7 +140,6 @@ struct page {
>   #endif
>   		};
>
> -		struct slab *slab_page; /* slab fields */
>   		struct rcu_head rcu_head;	/* Used by SLAB
>   						 * when destroying via RCU
>   						 */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 1/5] mm: drop page->slab_page
  2015-08-19  9:21   ` Kirill A. Shutemov
@ 2015-08-24 15:02     ` Vlastimil Babka
  -1 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-24 15:02 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Joonsoo Kim, Andi Kleen

On 08/19/2015 11:21 AM, Kirill A. Shutemov wrote:
> Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> uses slab_page field in struct page.
>
> Let's drop it.

Ah, how about dropping this comment in mm/slab.c:slab_destroy() as well?

                 /*
                  * RCU free overloads the RCU head over the LRU.
                  * slab_page has been overloeaded over the LRU,
                  * however it is not used from now on so that
                  * we can use it safely.
                  */


> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> ---
>   include/linux/mm_types.h | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0038ac7466fd..58620ac7f15c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -140,7 +140,6 @@ struct page {
>   #endif
>   		};
>
> -		struct slab *slab_page; /* slab fields */
>   		struct rcu_head rcu_head;	/* Used by SLAB
>   						 * when destroying via RCU
>   						 */
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 1/5] mm: drop page->slab_page
@ 2015-08-24 15:02     ` Vlastimil Babka
  0 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-24 15:02 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Joonsoo Kim, Andi Kleen

On 08/19/2015 11:21 AM, Kirill A. Shutemov wrote:
> Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> uses slab_page field in struct page.
>
> Let's drop it.

Ah, how about dropping this comment in mm/slab.c:slab_destroy() as well?

                 /*
                  * RCU free overloads the RCU head over the LRU.
                  * slab_page has been overloeaded over the LRU,
                  * however it is not used from now on so that
                  * we can use it safely.
                  */


> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> ---
>   include/linux/mm_types.h | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0038ac7466fd..58620ac7f15c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -140,7 +140,6 @@ struct page {
>   #endif
>   		};
>
> -		struct slab *slab_page; /* slab fields */
>   		struct rcu_head rcu_head;	/* Used by SLAB
>   						 * when destroying via RCU
>   						 */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 2/5] zsmalloc: use page->private instead of page->first_page
  2015-08-19  9:21   ` Kirill A. Shutemov
@ 2015-08-24 15:04     ` Vlastimil Babka
  -1 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-24 15:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm

On 08/19/2015 11:21 AM, Kirill A. Shutemov wrote:
> We are going to rework how compound_head() work. It will not use
> page->first_page as we have it now.
>
> The only other user of page->fisrt_page beyond compound pages is

                                ^ typo

> zsmalloc.
>
> Let's use page->private instead of page->first_page here. It occupies
> the same storage space.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/zsmalloc.c | 11 +++++------
>   1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 0a7f81aa2249..a85754e69879 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -16,7 +16,7 @@
>    * struct page(s) to form a zspage.
>    *
>    * Usage of struct page fields:
> - *	page->first_page: points to the first component (0-order) page
> + *	page->private: points to the first component (0-order) page
>    *	page->index (union with page->freelist): offset of the first object
>    *		starting in this page. For the first page, this is
>    *		always 0, so we use this field (aka freelist) to point
> @@ -26,8 +26,7 @@
>    *
>    *	For _first_ page only:
>    *
> - *	page->private (union with page->first_page): refers to the
> - *		component page after the first page
> + *	page->private: refers to the component page after the first page
>    *		If the page is first_page for huge object, it stores handle.
>    *		Look at size_class->huge.
>    *	page->freelist: points to the first free object in zspage.
> @@ -770,7 +769,7 @@ static struct page *get_first_page(struct page *page)
>   	if (is_first_page(page))
>   		return page;
>   	else
> -		return page->first_page;
> +		return (struct page *)page_private(page);
>   }
>
>   static struct page *get_next_page(struct page *page)
> @@ -955,7 +954,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>   	 * Allocate individual pages and link them together as:
>   	 * 1. first page->private = first sub-page
>   	 * 2. all sub-pages are linked together using page->lru
> -	 * 3. each sub-page is linked to the first page using page->first_page
> +	 * 3. each sub-page is linked to the first page using page->private
>   	 *
>   	 * For each size class, First/Head pages are linked together using
>   	 * page->lru. Also, we set PG_private to identify the first page
> @@ -980,7 +979,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>   		if (i == 1)
>   			set_page_private(first_page, (unsigned long)page);
>   		if (i >= 1)
> -			page->first_page = first_page;
> +			set_page_private(first_page, (unsigned long)first_page);
>   		if (i >= 2)
>   			list_add(&page->lru, &prev_page->lru);
>   		if (i == class->pages_per_zspage - 1)	/* last page */
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 2/5] zsmalloc: use page->private instead of page->first_page
@ 2015-08-24 15:04     ` Vlastimil Babka
  0 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-24 15:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm

On 08/19/2015 11:21 AM, Kirill A. Shutemov wrote:
> We are going to rework how compound_head() work. It will not use
> page->first_page as we have it now.
>
> The only other user of page->fisrt_page beyond compound pages is

                                ^ typo

> zsmalloc.
>
> Let's use page->private instead of page->first_page here. It occupies
> the same storage space.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/zsmalloc.c | 11 +++++------
>   1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 0a7f81aa2249..a85754e69879 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -16,7 +16,7 @@
>    * struct page(s) to form a zspage.
>    *
>    * Usage of struct page fields:
> - *	page->first_page: points to the first component (0-order) page
> + *	page->private: points to the first component (0-order) page
>    *	page->index (union with page->freelist): offset of the first object
>    *		starting in this page. For the first page, this is
>    *		always 0, so we use this field (aka freelist) to point
> @@ -26,8 +26,7 @@
>    *
>    *	For _first_ page only:
>    *
> - *	page->private (union with page->first_page): refers to the
> - *		component page after the first page
> + *	page->private: refers to the component page after the first page
>    *		If the page is first_page for huge object, it stores handle.
>    *		Look at size_class->huge.
>    *	page->freelist: points to the first free object in zspage.
> @@ -770,7 +769,7 @@ static struct page *get_first_page(struct page *page)
>   	if (is_first_page(page))
>   		return page;
>   	else
> -		return page->first_page;
> +		return (struct page *)page_private(page);
>   }
>
>   static struct page *get_next_page(struct page *page)
> @@ -955,7 +954,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>   	 * Allocate individual pages and link them together as:
>   	 * 1. first page->private = first sub-page
>   	 * 2. all sub-pages are linked together using page->lru
> -	 * 3. each sub-page is linked to the first page using page->first_page
> +	 * 3. each sub-page is linked to the first page using page->private
>   	 *
>   	 * For each size class, First/Head pages are linked together using
>   	 * page->lru. Also, we set PG_private to identify the first page
> @@ -980,7 +979,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>   		if (i == 1)
>   			set_page_private(first_page, (unsigned long)page);
>   		if (i >= 1)
> -			page->first_page = first_page;
> +			set_page_private(first_page, (unsigned long)first_page);
>   		if (i >= 2)
>   			list_add(&page->lru, &prev_page->lru);
>   		if (i == class->pages_per_zspage - 1)	/* last page */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-21 19:34             ` Andrew Morton
@ 2015-08-24 15:49               ` Vlastimil Babka
  -1 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-24 15:49 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Christoph Lameter, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm

On 08/21/2015 09:34 PM, Andrew Morton wrote:
> On Fri, 21 Aug 2015 22:31:09 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>
>> On Fri, Aug 21, 2015 at 11:11:27AM -0500, Christoph Lameter wrote:
>>> On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:
>>>
>>>>> Is this really true?  For example if it's a slab page, will that page
>>>>> ever be inspected by code which is looking for the PageTail bit?
>>>>
>>>> +Christoph.
>>>>
>>>> What we know for sure is that space is not used in tail pages, otherwise
>>>> it would collide with current compound_dtor.
>>>
>>> Sl*b allocators only do a virt_to_head_page on tail pages.
>>
>> The question was whether it's safe to assume that the bit 0 is always zero
>> in the word as this bit will encode PageTail().
>
> That wasn't my question actually...
>
> What I'm wondering is: if this page is being used for slab, will any
> code path ever run PageTail() against it?  If not, we don't need to be
> concerned about that bit.

Pfn scanners such as compaction might inspect such pages and run 
compound_head() (and thus PageTail) on them. I think no kind of page 
within a zone (slab or otherwise) is "protected" from this, which is why 
it needs to be robust.

> And slab was just the example I chose.  The same question petains to
> all other uses of that union.
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-24 15:49               ` Vlastimil Babka
  0 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-24 15:49 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Christoph Lameter, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm

On 08/21/2015 09:34 PM, Andrew Morton wrote:
> On Fri, 21 Aug 2015 22:31:09 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>
>> On Fri, Aug 21, 2015 at 11:11:27AM -0500, Christoph Lameter wrote:
>>> On Fri, 21 Aug 2015, Kirill A. Shutemov wrote:
>>>
>>>>> Is this really true?  For example if it's a slab page, will that page
>>>>> ever be inspected by code which is looking for the PageTail bit?
>>>>
>>>> +Christoph.
>>>>
>>>> What we know for sure is that space is not used in tail pages, otherwise
>>>> it would collide with current compound_dtor.
>>>
>>> Sl*b allocators only do a virt_to_head_page on tail pages.
>>
>> The question was whether it's safe to assume that the bit 0 is always zero
>> in the word as this bit will encode PageTail().
>
> That wasn't my question actually...
>
> What I'm wondering is: if this page is being used for slab, will any
> code path ever run PageTail() against it?  If not, we don't need to be
> concerned about that bit.

Pfn scanners such as compaction might inspect such pages and run 
compound_head() (and thus PageTail) on them. I think no kind of page 
within a zone (slab or otherwise) is "protected" from this, which is why 
it needs to be robust.

> And slab was just the example I chose.  The same question petains to
> all other uses of that union.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-21 12:10       ` Kirill A. Shutemov
@ 2015-08-25 11:44         ` Vlastimil Babka
  -1 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-25 11:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>>
>>> The patch introduces page->compound_head into third double word block in
>>> front of compound_dtor and compound_order. That means it shares storage
>>> space with:
>>>
>>>   - page->lru.next;
>>>   - page->next;
>>>   - page->rcu_head.next;
>>>   - page->pmd_huge_pte;
>>>

We should probably ask Paul about the chances that rcu_head.next would 
like to use the bit too one day?
For pgtable_t I can't think of anything better than a warning in the 
generic definition in include/asm-generic/page.h and hope that anyone 
reimplementing it for a new arch will look there first.
The lru part is probably the hardest to prevent danger. It can be used 
for any private purposes. Hopefully everyone currently uses only 
standard list operations here, and the list poison values don't set bit 
0. But I see there can be some arbitrary CONFIG_ILLEGAL_POINTER_VALUE 
added to the poisons, so maybe that's worth some build error check? 
Anyway we would be imposing restrictions on types that are not ours, so 
there might be some resistance...

>
>> Anyway, this is quite subtle and there's a risk that people will
>> accidentally break it later on.  I don't think the patch puts
>> sufficient documentation in place to prevent this.
>
> I would appreciate for suggestion on place and form of documentation.
>
>> And even documentation might not be enough to prevent accidents.
>
> The only think I can propose is VM_BUG_ON() in PageTail() and
> compound_head() which would ensure that page->compound_page points to
> place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.

That should probably catch some bad stuff, but probably only moments 
before it would crash anyway if the pointer was bogus. But I also don't 
see better way, because we can't proactively put checks in those who 
would "misbehave", as we don't know who they are. Putting more debug 
checks in e.g. page freeing might help, but probably not much.

> Do you consider this helpful?
>
>>>
>>> ...
>>>
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -120,7 +120,12 @@ struct page {
>>>   		};
>>>   	};
>>>
>>> -	/* Third double word block */
>>> +	/*
>>> +	 * Third double word block
>>> +	 *
>>> +	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
>>> +	 * for non-tail pages.
>>> +	 */
>>>   	union {
>>>   		struct list_head lru;	/* Pageout list, eg. active_list
>>>   					 * protected by zone->lru_lock !
>>> @@ -143,6 +148,7 @@ struct page {
>>>   						 */
>>>   		/* First tail page of compound page */

Note that compound_head is not just in the *first* tail page. Only the 
rest is.

>>>   		struct {
>>> +			unsigned long compound_head; /* If bit zero is set */
>>
>> I think the comments around here should have more details and should
>> be louder!
>
> I'm always bad when it comes to documentation. Is it enough?
>
> 	/*
> 	 * Third double word block
> 	 *
> 	 * WARNING: bit 0 of the first word encode PageTail(). That means
> 	 * the rest users of the storage space MUST NOT use the bit to
> 	 * avoid collision and false-positive PageTail().
> 	 */
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-25 11:44         ` Vlastimil Babka
  0 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-25 11:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>>
>>> The patch introduces page->compound_head into third double word block in
>>> front of compound_dtor and compound_order. That means it shares storage
>>> space with:
>>>
>>>   - page->lru.next;
>>>   - page->next;
>>>   - page->rcu_head.next;
>>>   - page->pmd_huge_pte;
>>>

We should probably ask Paul about the chances that rcu_head.next would 
like to use the bit too one day?
For pgtable_t I can't think of anything better than a warning in the 
generic definition in include/asm-generic/page.h and hope that anyone 
reimplementing it for a new arch will look there first.
The lru part is probably the hardest to prevent danger. It can be used 
for any private purposes. Hopefully everyone currently uses only 
standard list operations here, and the list poison values don't set bit 
0. But I see there can be some arbitrary CONFIG_ILLEGAL_POINTER_VALUE 
added to the poisons, so maybe that's worth some build error check? 
Anyway we would be imposing restrictions on types that are not ours, so 
there might be some resistance...

>
>> Anyway, this is quite subtle and there's a risk that people will
>> accidentally break it later on.  I don't think the patch puts
>> sufficient documentation in place to prevent this.
>
> I would appreciate for suggestion on place and form of documentation.
>
>> And even documentation might not be enough to prevent accidents.
>
> The only think I can propose is VM_BUG_ON() in PageTail() and
> compound_head() which would ensure that page->compound_page points to
> place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.

That should probably catch some bad stuff, but probably only moments 
before it would crash anyway if the pointer was bogus. But I also don't 
see better way, because we can't proactively put checks in those who 
would "misbehave", as we don't know who they are. Putting more debug 
checks in e.g. page freeing might help, but probably not much.

> Do you consider this helpful?
>
>>>
>>> ...
>>>
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -120,7 +120,12 @@ struct page {
>>>   		};
>>>   	};
>>>
>>> -	/* Third double word block */
>>> +	/*
>>> +	 * Third double word block
>>> +	 *
>>> +	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
>>> +	 * for non-tail pages.
>>> +	 */
>>>   	union {
>>>   		struct list_head lru;	/* Pageout list, eg. active_list
>>>   					 * protected by zone->lru_lock !
>>> @@ -143,6 +148,7 @@ struct page {
>>>   						 */
>>>   		/* First tail page of compound page */

Note that compound_head is not just in the *first* tail page. Only the 
rest is.

>>>   		struct {
>>> +			unsigned long compound_head; /* If bit zero is set */
>>
>> I think the comments around here should have more details and should
>> be louder!
>
> I'm always bad when it comes to documentation. Is it enough?
>
> 	/*
> 	 * Third double word block
> 	 *
> 	 * WARNING: bit 0 of the first word encode PageTail(). That means
> 	 * the rest users of the storage space MUST NOT use the bit to
> 	 * avoid collision and false-positive PageTail().
> 	 */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 1/5] mm: drop page->slab_page
  2015-08-24 15:02     ` Vlastimil Babka
@ 2015-08-25 17:24       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-25 17:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Joonsoo Kim, Andi Kleen

On Mon, Aug 24, 2015 at 05:02:16PM +0200, Vlastimil Babka wrote:
> On 08/19/2015 11:21 AM, Kirill A. Shutemov wrote:
> >Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> >uses slab_page field in struct page.
> >
> >Let's drop it.
> 
> Ah, how about dropping this comment in mm/slab.c:slab_destroy() as well?
> 
>                 /*
>                  * RCU free overloads the RCU head over the LRU.
>                  * slab_page has been overloeaded over the LRU,
>                  * however it is not used from now on so that
>                  * we can use it safely.
>                  */

Actually, whole block can be replaced by

	call_rcu(&page->rcu_head, kmem_rcu_free);

Thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 1/5] mm: drop page->slab_page
@ 2015-08-25 17:24       ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-25 17:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Joonsoo Kim, Andi Kleen

On Mon, Aug 24, 2015 at 05:02:16PM +0200, Vlastimil Babka wrote:
> On 08/19/2015 11:21 AM, Kirill A. Shutemov wrote:
> >Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> >uses slab_page field in struct page.
> >
> >Let's drop it.
> 
> Ah, how about dropping this comment in mm/slab.c:slab_destroy() as well?
> 
>                 /*
>                  * RCU free overloads the RCU head over the LRU.
>                  * slab_page has been overloeaded over the LRU,
>                  * however it is not used from now on so that
>                  * we can use it safely.
>                  */

Actually, whole block can be replaced by

	call_rcu(&page->rcu_head, kmem_rcu_free);

Thanks.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-25 11:44         ` Vlastimil Babka
@ 2015-08-25 18:33           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-25 18:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter,
	Paul E. McKenney

On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> >On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> >>On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >>
> >>>The patch introduces page->compound_head into third double word block in
> >>>front of compound_dtor and compound_order. That means it shares storage
> >>>space with:
> >>>
> >>>  - page->lru.next;
> >>>  - page->next;
> >>>  - page->rcu_head.next;
> >>>  - page->pmd_huge_pte;
> >>>
> 
> We should probably ask Paul about the chances that rcu_head.next would like
> to use the bit too one day?

+Paul.

> For pgtable_t I can't think of anything better than a warning in the generic
> definition in include/asm-generic/page.h and hope that anyone reimplementing
> it for a new arch will look there first.

I will move it to other word, just in case.

> The lru part is probably the hardest to prevent danger. It can be used for
> any private purposes. Hopefully everyone currently uses only standard list
> operations here, and the list poison values don't set bit 0. But I see there
> can be some arbitrary CONFIG_ILLEGAL_POINTER_VALUE added to the poisons, so
> maybe that's worth some build error check? Anyway we would be imposing
> restrictions on types that are not ours, so there might be some
> resistance...

I will add BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1); 

> >>Anyway, this is quite subtle and there's a risk that people will
> >>accidentally break it later on.  I don't think the patch puts
> >>sufficient documentation in place to prevent this.
> >
> >I would appreciate for suggestion on place and form of documentation.
> >
> >>And even documentation might not be enough to prevent accidents.
> >
> >The only think I can propose is VM_BUG_ON() in PageTail() and
> >compound_head() which would ensure that page->compound_page points to
> >place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.
> 
> That should probably catch some bad stuff, but probably only moments before
> it would crash anyway if the pointer was bogus. But I also don't see better
> way, because we can't proactively put checks in those who would "misbehave",
> as we don't know who they are. Putting more debug checks in e.g. page
> freeing might help, but probably not much.

So, do you think it worth it or not after all?
> 
> >Do you consider this helpful?
> >
> >>>
> >>>...
> >>>
> >>>--- a/include/linux/mm_types.h
> >>>+++ b/include/linux/mm_types.h
> >>>@@ -120,7 +120,12 @@ struct page {
> >>>  		};
> >>>  	};
> >>>
> >>>-	/* Third double word block */
> >>>+	/*
> >>>+	 * Third double word block
> >>>+	 *
> >>>+	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> >>>+	 * for non-tail pages.
> >>>+	 */
> >>>  	union {
> >>>  		struct list_head lru;	/* Pageout list, eg. active_list
> >>>  					 * protected by zone->lru_lock !
> >>>@@ -143,6 +148,7 @@ struct page {
> >>>  						 */
> >>>  		/* First tail page of compound page */
> 
> Note that compound_head is not just in the *first* tail page. Only the rest
> is.

Right.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-25 18:33           ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-25 18:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter,
	Paul E. McKenney

On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> >On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> >>On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >>
> >>>The patch introduces page->compound_head into third double word block in
> >>>front of compound_dtor and compound_order. That means it shares storage
> >>>space with:
> >>>
> >>>  - page->lru.next;
> >>>  - page->next;
> >>>  - page->rcu_head.next;
> >>>  - page->pmd_huge_pte;
> >>>
> 
> We should probably ask Paul about the chances that rcu_head.next would like
> to use the bit too one day?

+Paul.

> For pgtable_t I can't think of anything better than a warning in the generic
> definition in include/asm-generic/page.h and hope that anyone reimplementing
> it for a new arch will look there first.

I will move it to other word, just in case.

> The lru part is probably the hardest to prevent danger. It can be used for
> any private purposes. Hopefully everyone currently uses only standard list
> operations here, and the list poison values don't set bit 0. But I see there
> can be some arbitrary CONFIG_ILLEGAL_POINTER_VALUE added to the poisons, so
> maybe that's worth some build error check? Anyway we would be imposing
> restrictions on types that are not ours, so there might be some
> resistance...

I will add BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1); 

> >>Anyway, this is quite subtle and there's a risk that people will
> >>accidentally break it later on.  I don't think the patch puts
> >>sufficient documentation in place to prevent this.
> >
> >I would appreciate for suggestion on place and form of documentation.
> >
> >>And even documentation might not be enough to prevent accidents.
> >
> >The only think I can propose is VM_BUG_ON() in PageTail() and
> >compound_head() which would ensure that page->compound_page points to
> >place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.
> 
> That should probably catch some bad stuff, but probably only moments before
> it would crash anyway if the pointer was bogus. But I also don't see better
> way, because we can't proactively put checks in those who would "misbehave",
> as we don't know who they are. Putting more debug checks in e.g. page
> freeing might help, but probably not much.

So, do you think it worth it or not after all?
> 
> >Do you consider this helpful?
> >
> >>>
> >>>...
> >>>
> >>>--- a/include/linux/mm_types.h
> >>>+++ b/include/linux/mm_types.h
> >>>@@ -120,7 +120,12 @@ struct page {
> >>>  		};
> >>>  	};
> >>>
> >>>-	/* Third double word block */
> >>>+	/*
> >>>+	 * Third double word block
> >>>+	 *
> >>>+	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> >>>+	 * for non-tail pages.
> >>>+	 */
> >>>  	union {
> >>>  		struct list_head lru;	/* Pageout list, eg. active_list
> >>>  					 * protected by zone->lru_lock !
> >>>@@ -143,6 +148,7 @@ struct page {
> >>>  						 */
> >>>  		/* First tail page of compound page */
> 
> Note that compound_head is not just in the *first* tail page. Only the rest
> is.

Right.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-25 18:33           ` Kirill A. Shutemov
@ 2015-08-25 20:11             ` Paul E. McKenney
  -1 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-25 20:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > >On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > >>On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > >>
> > >>>The patch introduces page->compound_head into third double word block in
> > >>>front of compound_dtor and compound_order. That means it shares storage
> > >>>space with:
> > >>>
> > >>>  - page->lru.next;
> > >>>  - page->next;
> > >>>  - page->rcu_head.next;
> > >>>  - page->pmd_huge_pte;
> > >>>
> > 
> > We should probably ask Paul about the chances that rcu_head.next would like
> > to use the bit too one day?
> 
> +Paul.

The call_rcu() function does stomp that bit, but if you stop using that
bit before you invoke call_rcu(), no problem.

								Thanx, Paul

> > For pgtable_t I can't think of anything better than a warning in the generic
> > definition in include/asm-generic/page.h and hope that anyone reimplementing
> > it for a new arch will look there first.
> 
> I will move it to other word, just in case.
> 
> > The lru part is probably the hardest to prevent danger. It can be used for
> > any private purposes. Hopefully everyone currently uses only standard list
> > operations here, and the list poison values don't set bit 0. But I see there
> > can be some arbitrary CONFIG_ILLEGAL_POINTER_VALUE added to the poisons, so
> > maybe that's worth some build error check? Anyway we would be imposing
> > restrictions on types that are not ours, so there might be some
> > resistance...
> 
> I will add BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1); 
> 
> > >>Anyway, this is quite subtle and there's a risk that people will
> > >>accidentally break it later on.  I don't think the patch puts
> > >>sufficient documentation in place to prevent this.
> > >
> > >I would appreciate for suggestion on place and form of documentation.
> > >
> > >>And even documentation might not be enough to prevent accidents.
> > >
> > >The only think I can propose is VM_BUG_ON() in PageTail() and
> > >compound_head() which would ensure that page->compound_page points to
> > >place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.
> > 
> > That should probably catch some bad stuff, but probably only moments before
> > it would crash anyway if the pointer was bogus. But I also don't see better
> > way, because we can't proactively put checks in those who would "misbehave",
> > as we don't know who they are. Putting more debug checks in e.g. page
> > freeing might help, but probably not much.
> 
> So, do you think it worth it or not after all?
> > 
> > >Do you consider this helpful?
> > >
> > >>>
> > >>>...
> > >>>
> > >>>--- a/include/linux/mm_types.h
> > >>>+++ b/include/linux/mm_types.h
> > >>>@@ -120,7 +120,12 @@ struct page {
> > >>>  		};
> > >>>  	};
> > >>>
> > >>>-	/* Third double word block */
> > >>>+	/*
> > >>>+	 * Third double word block
> > >>>+	 *
> > >>>+	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> > >>>+	 * for non-tail pages.
> > >>>+	 */
> > >>>  	union {
> > >>>  		struct list_head lru;	/* Pageout list, eg. active_list
> > >>>  					 * protected by zone->lru_lock !
> > >>>@@ -143,6 +148,7 @@ struct page {
> > >>>  						 */
> > >>>  		/* First tail page of compound page */
> > 
> > Note that compound_head is not just in the *first* tail page. Only the rest
> > is.
> 
> Right.
> 
> -- 
>  Kirill A. Shutemov
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-25 20:11             ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-25 20:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > >On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > >>On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > >>
> > >>>The patch introduces page->compound_head into third double word block in
> > >>>front of compound_dtor and compound_order. That means it shares storage
> > >>>space with:
> > >>>
> > >>>  - page->lru.next;
> > >>>  - page->next;
> > >>>  - page->rcu_head.next;
> > >>>  - page->pmd_huge_pte;
> > >>>
> > 
> > We should probably ask Paul about the chances that rcu_head.next would like
> > to use the bit too one day?
> 
> +Paul.

The call_rcu() function does stomp that bit, but if you stop using that
bit before you invoke call_rcu(), no problem.

								Thanx, Paul

> > For pgtable_t I can't think of anything better than a warning in the generic
> > definition in include/asm-generic/page.h and hope that anyone reimplementing
> > it for a new arch will look there first.
> 
> I will move it to other word, just in case.
> 
> > The lru part is probably the hardest to prevent danger. It can be used for
> > any private purposes. Hopefully everyone currently uses only standard list
> > operations here, and the list poison values don't set bit 0. But I see there
> > can be some arbitrary CONFIG_ILLEGAL_POINTER_VALUE added to the poisons, so
> > maybe that's worth some build error check? Anyway we would be imposing
> > restrictions on types that are not ours, so there might be some
> > resistance...
> 
> I will add BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1); 
> 
> > >>Anyway, this is quite subtle and there's a risk that people will
> > >>accidentally break it later on.  I don't think the patch puts
> > >>sufficient documentation in place to prevent this.
> > >
> > >I would appreciate for suggestion on place and form of documentation.
> > >
> > >>And even documentation might not be enough to prevent accidents.
> > >
> > >The only think I can propose is VM_BUG_ON() in PageTail() and
> > >compound_head() which would ensure that page->compound_page points to
> > >place within MAX_ORDER_NR_PAGES before the current page if bit 0 is set.
> > 
> > That should probably catch some bad stuff, but probably only moments before
> > it would crash anyway if the pointer was bogus. But I also don't see better
> > way, because we can't proactively put checks in those who would "misbehave",
> > as we don't know who they are. Putting more debug checks in e.g. page
> > freeing might help, but probably not much.
> 
> So, do you think it worth it or not after all?
> > 
> > >Do you consider this helpful?
> > >
> > >>>
> > >>>...
> > >>>
> > >>>--- a/include/linux/mm_types.h
> > >>>+++ b/include/linux/mm_types.h
> > >>>@@ -120,7 +120,12 @@ struct page {
> > >>>  		};
> > >>>  	};
> > >>>
> > >>>-	/* Third double word block */
> > >>>+	/*
> > >>>+	 * Third double word block
> > >>>+	 *
> > >>>+	 * WARNING: bit 0 of the first word encode PageTail and *must* be 0
> > >>>+	 * for non-tail pages.
> > >>>+	 */
> > >>>  	union {
> > >>>  		struct list_head lru;	/* Pageout list, eg. active_list
> > >>>  					 * protected by zone->lru_lock !
> > >>>@@ -143,6 +148,7 @@ struct page {
> > >>>  						 */
> > >>>  		/* First tail page of compound page */
> > 
> > Note that compound_head is not just in the *first* tail page. Only the rest
> > is.
> 
> Right.
> 
> -- 
>  Kirill A. Shutemov
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-25 20:11             ` Paul E. McKenney
@ 2015-08-25 20:46               ` Vlastimil Babka
  -1 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-25 20:46 UTC (permalink / raw)
  To: paulmck, Kirill A. Shutemov
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On 25.8.2015 22:11, Paul E. McKenney wrote:
> On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
>>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
>>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
>>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>>>>>
>>>>>> The patch introduces page->compound_head into third double word block in
>>>>>> front of compound_dtor and compound_order. That means it shares storage
>>>>>> space with:
>>>>>>
>>>>>>  - page->lru.next;
>>>>>>  - page->next;
>>>>>>  - page->rcu_head.next;
>>>>>>  - page->pmd_huge_pte;
>>>>>>
>>>
>>> We should probably ask Paul about the chances that rcu_head.next would like
>>> to use the bit too one day?
>>
>> +Paul.
> 
> The call_rcu() function does stomp that bit, but if you stop using that
> bit before you invoke call_rcu(), no problem.

You mean that it sets the bit 0 of rcu_head.next during its processing? That's
bad news then. It's not that we would trigger that bit when the rcu_head part of
the union is "active". It's that pfn scanners could inspect such page at
arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
tail page of a compound page, and interpret the rest of the pointer as a pointer
to the head page (to test it for flags etc).



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-25 20:46               ` Vlastimil Babka
  0 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-25 20:46 UTC (permalink / raw)
  To: paulmck, Kirill A. Shutemov
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On 25.8.2015 22:11, Paul E. McKenney wrote:
> On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
>>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
>>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
>>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>>>>>
>>>>>> The patch introduces page->compound_head into third double word block in
>>>>>> front of compound_dtor and compound_order. That means it shares storage
>>>>>> space with:
>>>>>>
>>>>>>  - page->lru.next;
>>>>>>  - page->next;
>>>>>>  - page->rcu_head.next;
>>>>>>  - page->pmd_huge_pte;
>>>>>>
>>>
>>> We should probably ask Paul about the chances that rcu_head.next would like
>>> to use the bit too one day?
>>
>> +Paul.
> 
> The call_rcu() function does stomp that bit, but if you stop using that
> bit before you invoke call_rcu(), no problem.

You mean that it sets the bit 0 of rcu_head.next during its processing? That's
bad news then. It's not that we would trigger that bit when the rcu_head part of
the union is "active". It's that pfn scanners could inspect such page at
arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
tail page of a compound page, and interpret the rest of the pointer as a pointer
to the head page (to test it for flags etc).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-25 20:46               ` Vlastimil Babka
@ 2015-08-25 21:19                 ` Paul E. McKenney
  -1 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-25 21:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Kirill A. Shutemov,
	Hugh Dickins, Andrea Arcangeli, Dave Hansen, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> On 25.8.2015 22:11, Paul E. McKenney wrote:
> > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >>>>>
> >>>>>> The patch introduces page->compound_head into third double word block in
> >>>>>> front of compound_dtor and compound_order. That means it shares storage
> >>>>>> space with:
> >>>>>>
> >>>>>>  - page->lru.next;
> >>>>>>  - page->next;
> >>>>>>  - page->rcu_head.next;
> >>>>>>  - page->pmd_huge_pte;
> >>>>>>
> >>>
> >>> We should probably ask Paul about the chances that rcu_head.next would like
> >>> to use the bit too one day?
> >>
> >> +Paul.
> > 
> > The call_rcu() function does stomp that bit, but if you stop using that
> > bit before you invoke call_rcu(), no problem.
> 
> You mean that it sets the bit 0 of rcu_head.next during its processing?

Not at the moment, though RCU will splat if given a misaligned rcu_head
structure because of the possibility to use that bit to flag callbacks
that do nothing but free memory.  If RCU needs to do that (e.g., to
promote energy efficiency), then that bit might well be set during
RCU grace-period processing.

>                                                                         That's
> bad news then. It's not that we would trigger that bit when the rcu_head part of
> the union is "active". It's that pfn scanners could inspect such page at
> arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
> tail page of a compound page, and interpret the rest of the pointer as a pointer
> to the head page (to test it for flags etc).

On the other hand, if you avoid scanning rcu_head structures for pages
that are currently waiting for a grace period, no problem.  RCU does
not use the rcu_head structure at all except for during the time between
when call_rcu() is invoked on that rcu_head structure and the time that
the callback is invoked.

Is there some other page state that indicates that the page is waiting
for a grace period?  If so, you could simply avoid testing that bit in
that case.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-25 21:19                 ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-25 21:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Kirill A. Shutemov,
	Hugh Dickins, Andrea Arcangeli, Dave Hansen, Johannes Weiner,
	Michal Hocko, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> On 25.8.2015 22:11, Paul E. McKenney wrote:
> > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >>>>>
> >>>>>> The patch introduces page->compound_head into third double word block in
> >>>>>> front of compound_dtor and compound_order. That means it shares storage
> >>>>>> space with:
> >>>>>>
> >>>>>>  - page->lru.next;
> >>>>>>  - page->next;
> >>>>>>  - page->rcu_head.next;
> >>>>>>  - page->pmd_huge_pte;
> >>>>>>
> >>>
> >>> We should probably ask Paul about the chances that rcu_head.next would like
> >>> to use the bit too one day?
> >>
> >> +Paul.
> > 
> > The call_rcu() function does stomp that bit, but if you stop using that
> > bit before you invoke call_rcu(), no problem.
> 
> You mean that it sets the bit 0 of rcu_head.next during its processing?

Not at the moment, though RCU will splat if given a misaligned rcu_head
structure because of the possibility to use that bit to flag callbacks
that do nothing but free memory.  If RCU needs to do that (e.g., to
promote energy efficiency), then that bit might well be set during
RCU grace-period processing.

>                                                                         That's
> bad news then. It's not that we would trigger that bit when the rcu_head part of
> the union is "active". It's that pfn scanners could inspect such page at
> arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
> tail page of a compound page, and interpret the rest of the pointer as a pointer
> to the head page (to test it for flags etc).

On the other hand, if you avoid scanning rcu_head structures for pages
that are currently waiting for a grace period, no problem.  RCU does
not use the rcu_head structure at all except for during the time between
when call_rcu() is invoked on that rcu_head structure and the time that
the callback is invoked.

Is there some other page state that indicates that the page is waiting
for a grace period?  If so, you could simply avoid testing that bit in
that case.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-25 21:19                 ` Paul E. McKenney
@ 2015-08-26 15:04                   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-26 15:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vlastimil Babka, Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On Tue, Aug 25, 2015 at 02:19:54PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > >>>>>
> > >>>>>> The patch introduces page->compound_head into third double word block in
> > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > >>>>>> space with:
> > >>>>>>
> > >>>>>>  - page->lru.next;
> > >>>>>>  - page->next;
> > >>>>>>  - page->rcu_head.next;
> > >>>>>>  - page->pmd_huge_pte;
> > >>>>>>
> > >>>
> > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > >>> to use the bit too one day?
> > >>
> > >> +Paul.
> > > 
> > > The call_rcu() function does stomp that bit, but if you stop using that
> > > bit before you invoke call_rcu(), no problem.
> > 
> > You mean that it sets the bit 0 of rcu_head.next during its processing?
> 
> Not at the moment, though RCU will splat if given a misaligned rcu_head
> structure because of the possibility to use that bit to flag callbacks
> that do nothing but free memory.  If RCU needs to do that (e.g., to
> promote energy efficiency), then that bit might well be set during
> RCU grace-period processing.

Ugh.. :-/

> >                                                                         That's
> > bad news then. It's not that we would trigger that bit when the rcu_head part of
> > the union is "active". It's that pfn scanners could inspect such page at
> > arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
> > tail page of a compound page, and interpret the rest of the pointer as a pointer
> > to the head page (to test it for flags etc).
> 
> On the other hand, if you avoid scanning rcu_head structures for pages
> that are currently waiting for a grace period, no problem.  RCU does
> not use the rcu_head structure at all except for during the time between
> when call_rcu() is invoked on that rcu_head structure and the time that
> the callback is invoked.
> 
> Is there some other page state that indicates that the page is waiting
> for a grace period?  If so, you could simply avoid testing that bit in
> that case.

No, I don't think so.

For compound pages most of info of its state is stored in head page (e.g.
page_count(), flags, etc). So if we examine random page (pfn scanner case)
the very first thing we want to know if we stepped on tail page.
PageTail() is what I wanted to encode in the bit...

What if we change order of fields within rcu_head and put ->func first?
Can we expect this pointer to have bit 0 always clear?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-26 15:04                   ` Kirill A. Shutemov
  0 siblings, 0 replies; 96+ messages in thread
From: Kirill A. Shutemov @ 2015-08-26 15:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vlastimil Babka, Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On Tue, Aug 25, 2015 at 02:19:54PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > >>>>>
> > >>>>>> The patch introduces page->compound_head into third double word block in
> > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > >>>>>> space with:
> > >>>>>>
> > >>>>>>  - page->lru.next;
> > >>>>>>  - page->next;
> > >>>>>>  - page->rcu_head.next;
> > >>>>>>  - page->pmd_huge_pte;
> > >>>>>>
> > >>>
> > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > >>> to use the bit too one day?
> > >>
> > >> +Paul.
> > > 
> > > The call_rcu() function does stomp that bit, but if you stop using that
> > > bit before you invoke call_rcu(), no problem.
> > 
> > You mean that it sets the bit 0 of rcu_head.next during its processing?
> 
> Not at the moment, though RCU will splat if given a misaligned rcu_head
> structure because of the possibility to use that bit to flag callbacks
> that do nothing but free memory.  If RCU needs to do that (e.g., to
> promote energy efficiency), then that bit might well be set during
> RCU grace-period processing.

Ugh.. :-/

> >                                                                         That's
> > bad news then. It's not that we would trigger that bit when the rcu_head part of
> > the union is "active". It's that pfn scanners could inspect such page at
> > arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
> > tail page of a compound page, and interpret the rest of the pointer as a pointer
> > to the head page (to test it for flags etc).
> 
> On the other hand, if you avoid scanning rcu_head structures for pages
> that are currently waiting for a grace period, no problem.  RCU does
> not use the rcu_head structure at all except for during the time between
> when call_rcu() is invoked on that rcu_head structure and the time that
> the callback is invoked.
> 
> Is there some other page state that indicates that the page is waiting
> for a grace period?  If so, you could simply avoid testing that bit in
> that case.

No, I don't think so.

For compound pages most of info of its state is stored in head page (e.g.
page_count(), flags, etc). So if we examine random page (pfn scanner case)
the very first thing we want to know if we stepped on tail page.
PageTail() is what I wanted to encode in the bit...

What if we change order of fields within rcu_head and put ->func first?
Can we expect this pointer to have bit 0 always clear?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-26 15:04                   ` Kirill A. Shutemov
@ 2015-08-26 15:39                     ` Vlastimil Babka
  -1 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-26 15:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, Paul E. McKenney
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On 08/26/2015 05:04 PM, Kirill A. Shutemov wrote:
>>>                                                                          That's
>>> bad news then. It's not that we would trigger that bit when the rcu_head part of
>>> the union is "active". It's that pfn scanners could inspect such page at
>>> arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
>>> tail page of a compound page, and interpret the rest of the pointer as a pointer
>>> to the head page (to test it for flags etc).
>>
>> On the other hand, if you avoid scanning rcu_head structures for pages
>> that are currently waiting for a grace period, no problem.  RCU does
>> not use the rcu_head structure at all except for during the time between
>> when call_rcu() is invoked on that rcu_head structure and the time that
>> the callback is invoked.
>>
>> Is there some other page state that indicates that the page is waiting
>> for a grace period?  If so, you could simply avoid testing that bit in
>> that case.
>
> No, I don't think so.
>
> For compound pages most of info of its state is stored in head page (e.g.
> page_count(), flags, etc). So if we examine random page (pfn scanner case)
> the very first thing we want to know if we stepped on tail page.
> PageTail() is what I wanted to encode in the bit...
>
> What if we change order of fields within rcu_head and put ->func first?

Or change the order of compound_head wrt the rest?

> Can we expect this pointer to have bit 0 always clear?

That's probably a question whether $compiler is guaranteed to align 
functions on all architectures...

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-26 15:39                     ` Vlastimil Babka
  0 siblings, 0 replies; 96+ messages in thread
From: Vlastimil Babka @ 2015-08-26 15:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, Paul E. McKenney
  Cc: Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On 08/26/2015 05:04 PM, Kirill A. Shutemov wrote:
>>>                                                                          That's
>>> bad news then. It's not that we would trigger that bit when the rcu_head part of
>>> the union is "active". It's that pfn scanners could inspect such page at
>>> arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
>>> tail page of a compound page, and interpret the rest of the pointer as a pointer
>>> to the head page (to test it for flags etc).
>>
>> On the other hand, if you avoid scanning rcu_head structures for pages
>> that are currently waiting for a grace period, no problem.  RCU does
>> not use the rcu_head structure at all except for during the time between
>> when call_rcu() is invoked on that rcu_head structure and the time that
>> the callback is invoked.
>>
>> Is there some other page state that indicates that the page is waiting
>> for a grace period?  If so, you could simply avoid testing that bit in
>> that case.
>
> No, I don't think so.
>
> For compound pages most of info of its state is stored in head page (e.g.
> page_count(), flags, etc). So if we examine random page (pfn scanner case)
> the very first thing we want to know if we stepped on tail page.
> PageTail() is what I wanted to encode in the bit...
>
> What if we change order of fields within rcu_head and put ->func first?

Or change the order of compound_head wrt the rest?

> Can we expect this pointer to have bit 0 always clear?

That's probably a question whether $compiler is guaranteed to align 
functions on all architectures...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-26 15:04                   ` Kirill A. Shutemov
@ 2015-08-26 16:38                     ` Paul E. McKenney
  -1 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-26 16:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On Wed, Aug 26, 2015 at 06:04:12PM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 25, 2015 at 02:19:54PM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > >>>>>
> > > >>>>>> The patch introduces page->compound_head into third double word block in
> > > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > > >>>>>> space with:
> > > >>>>>>
> > > >>>>>>  - page->lru.next;
> > > >>>>>>  - page->next;
> > > >>>>>>  - page->rcu_head.next;
> > > >>>>>>  - page->pmd_huge_pte;
> > > >>>>>>
> > > >>>
> > > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > > >>> to use the bit too one day?
> > > >>
> > > >> +Paul.
> > > > 
> > > > The call_rcu() function does stomp that bit, but if you stop using that
> > > > bit before you invoke call_rcu(), no problem.
> > > 
> > > You mean that it sets the bit 0 of rcu_head.next during its processing?
> > 
> > Not at the moment, though RCU will splat if given a misaligned rcu_head
> > structure because of the possibility to use that bit to flag callbacks
> > that do nothing but free memory.  If RCU needs to do that (e.g., to
> > promote energy efficiency), then that bit might well be set during
> > RCU grace-period processing.
> 
> Ugh.. :-/
> 
> > > bad news then. It's not that we would trigger that bit when the rcu_head part of
> > > the union is "active". It's that pfn scanners could inspect such page at
> > > arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
> > > tail page of a compound page, and interpret the rest of the pointer as a pointer
> > > to the head page (to test it for flags etc).
> > 
> > On the other hand, if you avoid scanning rcu_head structures for pages
> > that are currently waiting for a grace period, no problem.  RCU does
> > not use the rcu_head structure at all except for during the time between
> > when call_rcu() is invoked on that rcu_head structure and the time that
> > the callback is invoked.
> > 
> > Is there some other page state that indicates that the page is waiting
> > for a grace period?  If so, you could simply avoid testing that bit in
> > that case.
> 
> No, I don't think so.

OK, I'll bite...  How do you know that it is safe to invoke call_rcu(),
given that you are not allowed to invoke call_rcu() until the previous
callback has been invoked?

> For compound pages most of info of its state is stored in head page (e.g.
> page_count(), flags, etc). So if we examine random page (pfn scanner case)
> the very first thing we want to know if we stepped on tail page.
> PageTail() is what I wanted to encode in the bit...

Ah, so that would require the page scanner to do reverse mapping or some
such, then.  Which is perhaps what you are trying to avoid.

> What if we change order of fields within rcu_head and put ->func first?
> Can we expect this pointer to have bit 0 always clear?

I asked that question some time back, and the answer was "no".  You
can apparently have functions that start at odd addresses on some
architectures.

That said, there are likely to be reserved bits somewhere in the function
address, perhaps varying depending on architecture and/or boot, in the
case of address-space randomization.  Perhaps some way of identifying
those bits with architecture-independent ways of querying and setting
them?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-26 16:38                     ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-26 16:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrew Morton, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, linux-kernel, linux-mm, Christoph Lameter

On Wed, Aug 26, 2015 at 06:04:12PM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 25, 2015 at 02:19:54PM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > >>>>>
> > > >>>>>> The patch introduces page->compound_head into third double word block in
> > > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > > >>>>>> space with:
> > > >>>>>>
> > > >>>>>>  - page->lru.next;
> > > >>>>>>  - page->next;
> > > >>>>>>  - page->rcu_head.next;
> > > >>>>>>  - page->pmd_huge_pte;
> > > >>>>>>
> > > >>>
> > > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > > >>> to use the bit too one day?
> > > >>
> > > >> +Paul.
> > > > 
> > > > The call_rcu() function does stomp that bit, but if you stop using that
> > > > bit before you invoke call_rcu(), no problem.
> > > 
> > > You mean that it sets the bit 0 of rcu_head.next during its processing?
> > 
> > Not at the moment, though RCU will splat if given a misaligned rcu_head
> > structure because of the possibility to use that bit to flag callbacks
> > that do nothing but free memory.  If RCU needs to do that (e.g., to
> > promote energy efficiency), then that bit might well be set during
> > RCU grace-period processing.
> 
> Ugh.. :-/
> 
> > > bad news then. It's not that we would trigger that bit when the rcu_head part of
> > > the union is "active". It's that pfn scanners could inspect such page at
> > > arbitrary time, see the bit 0 set (due to RCU processing) and think that it's a
> > > tail page of a compound page, and interpret the rest of the pointer as a pointer
> > > to the head page (to test it for flags etc).
> > 
> > On the other hand, if you avoid scanning rcu_head structures for pages
> > that are currently waiting for a grace period, no problem.  RCU does
> > not use the rcu_head structure at all except for during the time between
> > when call_rcu() is invoked on that rcu_head structure and the time that
> > the callback is invoked.
> > 
> > Is there some other page state that indicates that the page is waiting
> > for a grace period?  If so, you could simply avoid testing that bit in
> > that case.
> 
> No, I don't think so.

OK, I'll bite...  How do you know that it is safe to invoke call_rcu(),
given that you are not allowed to invoke call_rcu() until the previous
callback has been invoked?

> For compound pages most of info of its state is stored in head page (e.g.
> page_count(), flags, etc). So if we examine random page (pfn scanner case)
> the very first thing we want to know if we stepped on tail page.
> PageTail() is what I wanted to encode in the bit...

Ah, so that would require the page scanner to do reverse mapping or some
such, then.  Which is perhaps what you are trying to avoid.

> What if we change order of fields within rcu_head and put ->func first?
> Can we expect this pointer to have bit 0 always clear?

I asked that question some time back, and the answer was "no".  You
can apparently have functions that start at odd addresses on some
architectures.

That said, there are likely to be reserved bits somewhere in the function
address, perhaps varying depending on architecture and/or boot, in the
case of address-space randomization.  Perhaps some way of identifying
those bits with architecture-independent ways of querying and setting
them?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-25 21:19                 ` Paul E. McKenney
@ 2015-08-26 18:18                   ` Hugh Dickins
  -1 siblings, 0 replies; 96+ messages in thread
From: Hugh Dickins @ 2015-08-26 18:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On Tue, 25 Aug 2015, Paul E. McKenney wrote:
> On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > >>>>>
> > >>>>>> The patch introduces page->compound_head into third double word block in
> > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > >>>>>> space with:
> > >>>>>>
> > >>>>>>  - page->lru.next;
> > >>>>>>  - page->next;
> > >>>>>>  - page->rcu_head.next;
> > >>>>>>  - page->pmd_huge_pte;
> > >>>>>>
> > >>>
> > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > >>> to use the bit too one day?
> > >>
> > >> +Paul.
> > > 
> > > The call_rcu() function does stomp that bit, but if you stop using that
> > > bit before you invoke call_rcu(), no problem.
> > 
> > You mean that it sets the bit 0 of rcu_head.next during its processing?
> 
> Not at the moment, though RCU will splat if given a misaligned rcu_head
> structure because of the possibility to use that bit to flag callbacks
> that do nothing but free memory.  If RCU needs to do that (e.g., to
> promote energy efficiency), then that bit might well be set during
> RCU grace-period processing.

But if you do one day implement that, wouldn't sl?b.c have to use
call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
of getting that bit set?  (No rcu_head is placed in a PageTail page.)

So although it might be a little strange not to use a variant intended
for freeing memory when indeed that's what it's doing, it would not be
the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
call_rcu(), in defence of the struct page safety Kirill is proposing.

hUgh

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-26 18:18                   ` Hugh Dickins
  0 siblings, 0 replies; 96+ messages in thread
From: Hugh Dickins @ 2015-08-26 18:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On Tue, 25 Aug 2015, Paul E. McKenney wrote:
> On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > >>>>>
> > >>>>>> The patch introduces page->compound_head into third double word block in
> > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > >>>>>> space with:
> > >>>>>>
> > >>>>>>  - page->lru.next;
> > >>>>>>  - page->next;
> > >>>>>>  - page->rcu_head.next;
> > >>>>>>  - page->pmd_huge_pte;
> > >>>>>>
> > >>>
> > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > >>> to use the bit too one day?
> > >>
> > >> +Paul.
> > > 
> > > The call_rcu() function does stomp that bit, but if you stop using that
> > > bit before you invoke call_rcu(), no problem.
> > 
> > You mean that it sets the bit 0 of rcu_head.next during its processing?
> 
> Not at the moment, though RCU will splat if given a misaligned rcu_head
> structure because of the possibility to use that bit to flag callbacks
> that do nothing but free memory.  If RCU needs to do that (e.g., to
> promote energy efficiency), then that bit might well be set during
> RCU grace-period processing.

But if you do one day implement that, wouldn't sl?b.c have to use
call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
of getting that bit set?  (No rcu_head is placed in a PageTail page.)

So although it might be a little strange not to use a variant intended
for freeing memory when indeed that's what it's doing, it would not be
the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
call_rcu(), in defence of the struct page safety Kirill is proposing.

hUgh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-26 18:18                   ` Hugh Dickins
@ 2015-08-26 21:29                     ` Paul E. McKenney
  -1 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-26 21:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> On Tue, 25 Aug 2015, Paul E. McKenney wrote:
> > On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > >>>>>
> > > >>>>>> The patch introduces page->compound_head into third double word block in
> > > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > > >>>>>> space with:
> > > >>>>>>
> > > >>>>>>  - page->lru.next;
> > > >>>>>>  - page->next;
> > > >>>>>>  - page->rcu_head.next;
> > > >>>>>>  - page->pmd_huge_pte;
> > > >>>>>>
> > > >>>
> > > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > > >>> to use the bit too one day?
> > > >>
> > > >> +Paul.
> > > > 
> > > > The call_rcu() function does stomp that bit, but if you stop using that
> > > > bit before you invoke call_rcu(), no problem.
> > > 
> > > You mean that it sets the bit 0 of rcu_head.next during its processing?
> > 
> > Not at the moment, though RCU will splat if given a misaligned rcu_head
> > structure because of the possibility to use that bit to flag callbacks
> > that do nothing but free memory.  If RCU needs to do that (e.g., to
> > promote energy efficiency), then that bit might well be set during
> > RCU grace-period processing.
> 
> But if you do one day implement that, wouldn't sl?b.c have to use
> call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> of getting that bit set?  (No rcu_head is placed in a PageTail page.)

Good point, call_rcu_lazy(), but yes.

> So although it might be a little strange not to use a variant intended
> for freeing memory when indeed that's what it's doing, it would not be
> the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> call_rcu(), in defence of the struct page safety Kirill is proposing.

As long as you are OK with the bottom bit being zero throughout the RCU
processing, yes.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-26 21:29                     ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-26 21:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> On Tue, 25 Aug 2015, Paul E. McKenney wrote:
> > On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > >>>>>
> > > >>>>>> The patch introduces page->compound_head into third double word block in
> > > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > > >>>>>> space with:
> > > >>>>>>
> > > >>>>>>  - page->lru.next;
> > > >>>>>>  - page->next;
> > > >>>>>>  - page->rcu_head.next;
> > > >>>>>>  - page->pmd_huge_pte;
> > > >>>>>>
> > > >>>
> > > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > > >>> to use the bit too one day?
> > > >>
> > > >> +Paul.
> > > > 
> > > > The call_rcu() function does stomp that bit, but if you stop using that
> > > > bit before you invoke call_rcu(), no problem.
> > > 
> > > You mean that it sets the bit 0 of rcu_head.next during its processing?
> > 
> > Not at the moment, though RCU will splat if given a misaligned rcu_head
> > structure because of the possibility to use that bit to flag callbacks
> > that do nothing but free memory.  If RCU needs to do that (e.g., to
> > promote energy efficiency), then that bit might well be set during
> > RCU grace-period processing.
> 
> But if you do one day implement that, wouldn't sl?b.c have to use
> call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> of getting that bit set?  (No rcu_head is placed in a PageTail page.)

Good point, call_rcu_lazy(), but yes.

> So although it might be a little strange not to use a variant intended
> for freeing memory when indeed that's what it's doing, it would not be
> the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> call_rcu(), in defence of the struct page safety Kirill is proposing.

As long as you are OK with the bottom bit being zero throughout the RCU
processing, yes.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-26 21:29                     ` Paul E. McKenney
@ 2015-08-26 22:28                       ` Hugh Dickins
  -1 siblings, 0 replies; 96+ messages in thread
From: Hugh Dickins @ 2015-08-26 22:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On Wed, 26 Aug 2015, Paul E. McKenney wrote:
> On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> > On Tue, 25 Aug 2015, Paul E. McKenney wrote:
> > > On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > > > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > > > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > > > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > > > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > > > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > > >>>>>
> > > > >>>>>> The patch introduces page->compound_head into third double word block in
> > > > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > > > >>>>>> space with:
> > > > >>>>>>
> > > > >>>>>>  - page->lru.next;
> > > > >>>>>>  - page->next;
> > > > >>>>>>  - page->rcu_head.next;
> > > > >>>>>>  - page->pmd_huge_pte;
> > > > >>>>>>
> > > > >>>
> > > > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > > > >>> to use the bit too one day?
> > > > >>
> > > > >> +Paul.
> > > > > 
> > > > > The call_rcu() function does stomp that bit, but if you stop using that
> > > > > bit before you invoke call_rcu(), no problem.
> > > > 
> > > > You mean that it sets the bit 0 of rcu_head.next during its processing?
> > > 
> > > Not at the moment, though RCU will splat if given a misaligned rcu_head
> > > structure because of the possibility to use that bit to flag callbacks
> > > that do nothing but free memory.  If RCU needs to do that (e.g., to
> > > promote energy efficiency), then that bit might well be set during
> > > RCU grace-period processing.
> > 
> > But if you do one day implement that, wouldn't sl?b.c have to use
> > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> 
> Good point, call_rcu_lazy(), but yes.
> 
> > So although it might be a little strange not to use a variant intended
> > for freeing memory when indeed that's what it's doing, it would not be
> > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > call_rcu(), in defence of the struct page safety Kirill is proposing.
> 
> As long as you are OK with the bottom bit being zero throughout the RCU
> processing, yes.

That's exactly what we want: sounds like we have no problem, thanks Paul.

Hugh

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-26 22:28                       ` Hugh Dickins
  0 siblings, 0 replies; 96+ messages in thread
From: Hugh Dickins @ 2015-08-26 22:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On Wed, 26 Aug 2015, Paul E. McKenney wrote:
> On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> > On Tue, 25 Aug 2015, Paul E. McKenney wrote:
> > > On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > > > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > > > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > > > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > > > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > > > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > > >>>>>
> > > > >>>>>> The patch introduces page->compound_head into third double word block in
> > > > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > > > >>>>>> space with:
> > > > >>>>>>
> > > > >>>>>>  - page->lru.next;
> > > > >>>>>>  - page->next;
> > > > >>>>>>  - page->rcu_head.next;
> > > > >>>>>>  - page->pmd_huge_pte;
> > > > >>>>>>
> > > > >>>
> > > > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > > > >>> to use the bit too one day?
> > > > >>
> > > > >> +Paul.
> > > > > 
> > > > > The call_rcu() function does stomp that bit, but if you stop using that
> > > > > bit before you invoke call_rcu(), no problem.
> > > > 
> > > > You mean that it sets the bit 0 of rcu_head.next during its processing?
> > > 
> > > Not at the moment, though RCU will splat if given a misaligned rcu_head
> > > structure because of the possibility to use that bit to flag callbacks
> > > that do nothing but free memory.  If RCU needs to do that (e.g., to
> > > promote energy efficiency), then that bit might well be set during
> > > RCU grace-period processing.
> > 
> > But if you do one day implement that, wouldn't sl?b.c have to use
> > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> 
> Good point, call_rcu_lazy(), but yes.
> 
> > So although it might be a little strange not to use a variant intended
> > for freeing memory when indeed that's what it's doing, it would not be
> > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > call_rcu(), in defence of the struct page safety Kirill is proposing.
> 
> As long as you are OK with the bottom bit being zero throughout the RCU
> processing, yes.

That's exactly what we want: sounds like we have no problem, thanks Paul.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-26 22:28                       ` Hugh Dickins
@ 2015-08-26 23:34                         ` Paul E. McKenney
  -1 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-26 23:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On Wed, Aug 26, 2015 at 03:28:39PM -0700, Hugh Dickins wrote:
> On Wed, 26 Aug 2015, Paul E. McKenney wrote:
> > On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> > > On Tue, 25 Aug 2015, Paul E. McKenney wrote:
> > > > On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > > > > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > > > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > > > > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > > > > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > > > > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > > > > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > > > >>>>>
> > > > > >>>>>> The patch introduces page->compound_head into third double word block in
> > > > > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > > > > >>>>>> space with:
> > > > > >>>>>>
> > > > > >>>>>>  - page->lru.next;
> > > > > >>>>>>  - page->next;
> > > > > >>>>>>  - page->rcu_head.next;
> > > > > >>>>>>  - page->pmd_huge_pte;
> > > > > >>>>>>
> > > > > >>>
> > > > > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > > > > >>> to use the bit too one day?
> > > > > >>
> > > > > >> +Paul.
> > > > > > 
> > > > > > The call_rcu() function does stomp that bit, but if you stop using that
> > > > > > bit before you invoke call_rcu(), no problem.
> > > > > 
> > > > > You mean that it sets the bit 0 of rcu_head.next during its processing?
> > > > 
> > > > Not at the moment, though RCU will splat if given a misaligned rcu_head
> > > > structure because of the possibility to use that bit to flag callbacks
> > > > that do nothing but free memory.  If RCU needs to do that (e.g., to
> > > > promote energy efficiency), then that bit might well be set during
> > > > RCU grace-period processing.
> > > 
> > > But if you do one day implement that, wouldn't sl?b.c have to use
> > > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> > 
> > Good point, call_rcu_lazy(), but yes.
> > 
> > > So although it might be a little strange not to use a variant intended
> > > for freeing memory when indeed that's what it's doing, it would not be
> > > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > > call_rcu(), in defence of the struct page safety Kirill is proposing.
> > 
> > As long as you are OK with the bottom bit being zero throughout the RCU
> > processing, yes.
> 
> That's exactly what we want: sounds like we have no problem, thanks Paul.

Whew!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-26 23:34                         ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-26 23:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, Michal Hocko, David Rientjes, linux-kernel,
	linux-mm, Christoph Lameter

On Wed, Aug 26, 2015 at 03:28:39PM -0700, Hugh Dickins wrote:
> On Wed, 26 Aug 2015, Paul E. McKenney wrote:
> > On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> > > On Tue, 25 Aug 2015, Paul E. McKenney wrote:
> > > > On Tue, Aug 25, 2015 at 10:46:44PM +0200, Vlastimil Babka wrote:
> > > > > On 25.8.2015 22:11, Paul E. McKenney wrote:
> > > > > > On Tue, Aug 25, 2015 at 09:33:54PM +0300, Kirill A. Shutemov wrote:
> > > > > >> On Tue, Aug 25, 2015 at 01:44:13PM +0200, Vlastimil Babka wrote:
> > > > > >>> On 08/21/2015 02:10 PM, Kirill A. Shutemov wrote:
> > > > > >>>> On Thu, Aug 20, 2015 at 04:36:43PM -0700, Andrew Morton wrote:
> > > > > >>>>> On Wed, 19 Aug 2015 12:21:45 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > > > >>>>>
> > > > > >>>>>> The patch introduces page->compound_head into third double word block in
> > > > > >>>>>> front of compound_dtor and compound_order. That means it shares storage
> > > > > >>>>>> space with:
> > > > > >>>>>>
> > > > > >>>>>>  - page->lru.next;
> > > > > >>>>>>  - page->next;
> > > > > >>>>>>  - page->rcu_head.next;
> > > > > >>>>>>  - page->pmd_huge_pte;
> > > > > >>>>>>
> > > > > >>>
> > > > > >>> We should probably ask Paul about the chances that rcu_head.next would like
> > > > > >>> to use the bit too one day?
> > > > > >>
> > > > > >> +Paul.
> > > > > > 
> > > > > > The call_rcu() function does stomp that bit, but if you stop using that
> > > > > > bit before you invoke call_rcu(), no problem.
> > > > > 
> > > > > You mean that it sets the bit 0 of rcu_head.next during its processing?
> > > > 
> > > > Not at the moment, though RCU will splat if given a misaligned rcu_head
> > > > structure because of the possibility to use that bit to flag callbacks
> > > > that do nothing but free memory.  If RCU needs to do that (e.g., to
> > > > promote energy efficiency), then that bit might well be set during
> > > > RCU grace-period processing.
> > > 
> > > But if you do one day implement that, wouldn't sl?b.c have to use
> > > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> > 
> > Good point, call_rcu_lazy(), but yes.
> > 
> > > So although it might be a little strange not to use a variant intended
> > > for freeing memory when indeed that's what it's doing, it would not be
> > > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > > call_rcu(), in defence of the struct page safety Kirill is proposing.
> > 
> > As long as you are OK with the bottom bit being zero throughout the RCU
> > processing, yes.
> 
> That's exactly what we want: sounds like we have no problem, thanks Paul.

Whew!  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-26 21:29                     ` Paul E. McKenney
@ 2015-08-27 15:09                       ` Michal Hocko
  -1 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-27 15:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Wed 26-08-15 14:29:16, Paul E. McKenney wrote:
> On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
[...]
> > But if you do one day implement that, wouldn't sl?b.c have to use
> > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> 
> Good point, call_rcu_lazy(), but yes.
> 
> > So although it might be a little strange not to use a variant intended
> > for freeing memory when indeed that's what it's doing, it would not be
> > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > call_rcu(), in defence of the struct page safety Kirill is proposing.
> 
> As long as you are OK with the bottom bit being zero throughout the RCU
> processing, yes.

I am really not sure I udnerstand. What will prevent
call_rcu(&page->rcu_head, free_page_rcu) done in a random driver?

Cannot the RCU simply claim bit1? I can see 1146edcbef37 ("rcu: Loosen
__call_rcu()'s rcu_head alignment constraint") but AFAIU all it would
take to fix this would be to require struct rcu_head to be aligned to
32b no?

Btw. Do we need the same think for page::mapping and KSM?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-27 15:09                       ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-27 15:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Wed 26-08-15 14:29:16, Paul E. McKenney wrote:
> On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
[...]
> > But if you do one day implement that, wouldn't sl?b.c have to use
> > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> 
> Good point, call_rcu_lazy(), but yes.
> 
> > So although it might be a little strange not to use a variant intended
> > for freeing memory when indeed that's what it's doing, it would not be
> > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > call_rcu(), in defence of the struct page safety Kirill is proposing.
> 
> As long as you are OK with the bottom bit being zero throughout the RCU
> processing, yes.

I am really not sure I udnerstand. What will prevent
call_rcu(&page->rcu_head, free_page_rcu) done in a random driver?

Cannot the RCU simply claim bit1? I can see 1146edcbef37 ("rcu: Loosen
__call_rcu()'s rcu_head alignment constraint") but AFAIU all it would
take to fix this would be to require struct rcu_head to be aligned to
32b no?

Btw. Do we need the same think for page::mapping and KSM?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-27 15:09                       ` Michal Hocko
@ 2015-08-27 16:03                         ` Michal Hocko
  -1 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-27 16:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu 27-08-15 17:09:17, Michal Hocko wrote:
[...]
> Btw. Do we need the same think for page::mapping and KSM?

I guess we are safe here because the address for mappings comes from
kmalloc and that aligned properly, right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-27 16:03                         ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-27 16:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu 27-08-15 17:09:17, Michal Hocko wrote:
[...]
> Btw. Do we need the same think for page::mapping and KSM?

I guess we are safe here because the address for mappings comes from
kmalloc and that aligned properly, right?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-27 15:09                       ` Michal Hocko
@ 2015-08-27 16:36                         ` Paul E. McKenney
  -1 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-27 16:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu, Aug 27, 2015 at 05:09:17PM +0200, Michal Hocko wrote:
> On Wed 26-08-15 14:29:16, Paul E. McKenney wrote:
> > On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> [...]
> > > But if you do one day implement that, wouldn't sl?b.c have to use
> > > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> > 
> > Good point, call_rcu_lazy(), but yes.
> > 
> > > So although it might be a little strange not to use a variant intended
> > > for freeing memory when indeed that's what it's doing, it would not be
> > > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > > call_rcu(), in defence of the struct page safety Kirill is proposing.
> > 
> > As long as you are OK with the bottom bit being zero throughout the RCU
> > processing, yes.
> 
> I am really not sure I udnerstand. What will prevent
> call_rcu(&page->rcu_head, free_page_rcu) done in a random driver?

As long as it uses call_rcu(), call_rcu_bh(), call_rcu_sched(),
or call_srcu() and not some future call_rcu_lazy(), no problem.

But yes, if you are going to assume that RCU leaves the bottom
bit of the rcu_head structure's ->next field zero, then everything
everywhere in the kernel might in the future need to be careful of
exactly what variant of call_rcu() is used.

> Cannot the RCU simply claim bit1? I can see 1146edcbef37 ("rcu: Loosen
> __call_rcu()'s rcu_head alignment constraint") but AFAIU all it would
> take to fix this would be to require struct rcu_head to be aligned to
> 32b no?

There are some architectures that guarantee only 16-bit alignment.
If those architectures are fixed to do 32-bit alignment, or if support
for them is dropped, then the future restrictions mentioned above could
be dropped.

							Thanx, Paul

> Btw. Do we need the same think for page::mapping and KSM?
> -- 
> Michal Hocko
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-27 16:36                         ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-27 16:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu, Aug 27, 2015 at 05:09:17PM +0200, Michal Hocko wrote:
> On Wed 26-08-15 14:29:16, Paul E. McKenney wrote:
> > On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> [...]
> > > But if you do one day implement that, wouldn't sl?b.c have to use
> > > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> > 
> > Good point, call_rcu_lazy(), but yes.
> > 
> > > So although it might be a little strange not to use a variant intended
> > > for freeing memory when indeed that's what it's doing, it would not be
> > > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > > call_rcu(), in defence of the struct page safety Kirill is proposing.
> > 
> > As long as you are OK with the bottom bit being zero throughout the RCU
> > processing, yes.
> 
> I am really not sure I udnerstand. What will prevent
> call_rcu(&page->rcu_head, free_page_rcu) done in a random driver?

As long as it uses call_rcu(), call_rcu_bh(), call_rcu_sched(),
or call_srcu() and not some future call_rcu_lazy(), no problem.

But yes, if you are going to assume that RCU leaves the bottom
bit of the rcu_head structure's ->next field zero, then everything
everywhere in the kernel might in the future need to be careful of
exactly what variant of call_rcu() is used.

> Cannot the RCU simply claim bit1? I can see 1146edcbef37 ("rcu: Loosen
> __call_rcu()'s rcu_head alignment constraint") but AFAIU all it would
> take to fix this would be to require struct rcu_head to be aligned to
> 32b no?

There are some architectures that guarantee only 16-bit alignment.
If those architectures are fixed to do 32-bit alignment, or if support
for them is dropped, then the future restrictions mentioned above could
be dropped.

							Thanx, Paul

> Btw. Do we need the same think for page::mapping and KSM?
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-27 16:03                         ` Michal Hocko
@ 2015-08-27 17:28                           ` Hugh Dickins
  -1 siblings, 0 replies; 96+ messages in thread
From: Hugh Dickins @ 2015-08-27 17:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Paul E. McKenney, Hugh Dickins, Vlastimil Babka,
	Kirill A. Shutemov, Andrew Morton, Kirill A. Shutemov,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, David Rientjes,
	linux-kernel, linux-mm, Christoph Lameter

On Thu, 27 Aug 2015, Michal Hocko wrote:
> On Thu 27-08-15 17:09:17, Michal Hocko wrote:
> [...]
> > Btw. Do we need the same think for page::mapping and KSM?
> 
> I guess we are safe here because the address for mappings comes from
> kmalloc and that aligned properly, right?

Not quite right, in fact.  Because usually the struct address_space
is embedded within the struct inode (at i_data), and the struct inode
embedded within the fs-dependent inode, and that's what's kmalloc'ed.

What makes the mapping pointer low bits safe is include/linux/fs.h:
struct address_space {
	...
} __attribute__((aligned(sizeof(long))));

Which we first had to add in for the cris architecture, which stumbled
not on a genuine allocated address_space, but on that funny statically
declared swapper_space in mm/swap_state.c.

But struct anon_vma and KSM's struct stable_node (which depend on
the same scheme for low bits of page->mapping) have no such alignment
attribute specified: those ones are indeed relying on the kmalloc
guarantee as you suppose.

Does struct rcu_head have no __attribute__((aligned(whatever)))?
Perhaps that attribute should be added when it's needed.

Hugh

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-27 17:28                           ` Hugh Dickins
  0 siblings, 0 replies; 96+ messages in thread
From: Hugh Dickins @ 2015-08-27 17:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Paul E. McKenney, Hugh Dickins, Vlastimil Babka,
	Kirill A. Shutemov, Andrew Morton, Kirill A. Shutemov,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, David Rientjes,
	linux-kernel, linux-mm, Christoph Lameter

On Thu, 27 Aug 2015, Michal Hocko wrote:
> On Thu 27-08-15 17:09:17, Michal Hocko wrote:
> [...]
> > Btw. Do we need the same think for page::mapping and KSM?
> 
> I guess we are safe here because the address for mappings comes from
> kmalloc and that aligned properly, right?

Not quite right, in fact.  Because usually the struct address_space
is embedded within the struct inode (at i_data), and the struct inode
embedded within the fs-dependent inode, and that's what's kmalloc'ed.

What makes the mapping pointer low bits safe is include/linux/fs.h:
struct address_space {
	...
} __attribute__((aligned(sizeof(long))));

Which we first had to add in for the cris architecture, which stumbled
not on a genuine allocated address_space, but on that funny statically
declared swapper_space in mm/swap_state.c.

But struct anon_vma and KSM's struct stable_node (which depend on
the same scheme for low bits of page->mapping) have no such alignment
attribute specified: those ones are indeed relying on the kmalloc
guarantee as you suppose.

Does struct rcu_head have no __attribute__((aligned(whatever)))?
Perhaps that attribute should be added when it's needed.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-27 17:28                           ` Hugh Dickins
@ 2015-08-27 18:06                             ` Michal Hocko
  -1 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-27 18:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Paul E. McKenney, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu 27-08-15 10:28:48, Hugh Dickins wrote:
> On Thu, 27 Aug 2015, Michal Hocko wrote:
> > On Thu 27-08-15 17:09:17, Michal Hocko wrote:
> > [...]
> > > Btw. Do we need the same think for page::mapping and KSM?
> > 
> > I guess we are safe here because the address for mappings comes from
> > kmalloc and that aligned properly, right?
> 
> Not quite right, in fact.  Because usually the struct address_space
> is embedded within the struct inode (at i_data), and the struct inode
> embedded within the fs-dependent inode, and that's what's kmalloc'ed.
> 
> What makes the mapping pointer low bits safe is include/linux/fs.h:
> struct address_space {
> 	...
> } __attribute__((aligned(sizeof(long))));

Oh, right you are.

> Which we first had to add in for the cris architecture, which stumbled
> not on a genuine allocated address_space, but on that funny statically
> declared swapper_space in mm/swap_state.c.

Thanks for the clarification.

> But struct anon_vma and KSM's struct stable_node (which depend on
> the same scheme for low bits of page->mapping) have no such alignment
> attribute specified: those ones are indeed relying on the kmalloc
> guarantee as you suppose.
> 
> Does struct rcu_head have no __attribute__((aligned(whatever)))?
> Perhaps that attribute should be added when it's needed.

That's basically what I meant in the previous email.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-27 18:06                             ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-27 18:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Paul E. McKenney, Vlastimil Babka, Kirill A. Shutemov,
	Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu 27-08-15 10:28:48, Hugh Dickins wrote:
> On Thu, 27 Aug 2015, Michal Hocko wrote:
> > On Thu 27-08-15 17:09:17, Michal Hocko wrote:
> > [...]
> > > Btw. Do we need the same think for page::mapping and KSM?
> > 
> > I guess we are safe here because the address for mappings comes from
> > kmalloc and that aligned properly, right?
> 
> Not quite right, in fact.  Because usually the struct address_space
> is embedded within the struct inode (at i_data), and the struct inode
> embedded within the fs-dependent inode, and that's what's kmalloc'ed.
> 
> What makes the mapping pointer low bits safe is include/linux/fs.h:
> struct address_space {
> 	...
> } __attribute__((aligned(sizeof(long))));

Oh, right you are.

> Which we first had to add in for the cris architecture, which stumbled
> not on a genuine allocated address_space, but on that funny statically
> declared swapper_space in mm/swap_state.c.

Thanks for the clarification.

> But struct anon_vma and KSM's struct stable_node (which depend on
> the same scheme for low bits of page->mapping) have no such alignment
> attribute specified: those ones are indeed relying on the kmalloc
> guarantee as you suppose.
> 
> Does struct rcu_head have no __attribute__((aligned(whatever)))?
> Perhaps that attribute should be added when it's needed.

That's basically what I meant in the previous email.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-27 16:36                         ` Paul E. McKenney
@ 2015-08-27 18:14                           ` Michal Hocko
  -1 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-27 18:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu 27-08-15 09:36:34, Paul E. McKenney wrote:
> On Thu, Aug 27, 2015 at 05:09:17PM +0200, Michal Hocko wrote:
> > On Wed 26-08-15 14:29:16, Paul E. McKenney wrote:
> > > On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> > [...]
> > > > But if you do one day implement that, wouldn't sl?b.c have to use
> > > > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > > > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> > > 
> > > Good point, call_rcu_lazy(), but yes.
> > > 
> > > > So although it might be a little strange not to use a variant intended
> > > > for freeing memory when indeed that's what it's doing, it would not be
> > > > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > > > call_rcu(), in defence of the struct page safety Kirill is proposing.
> > > 
> > > As long as you are OK with the bottom bit being zero throughout the RCU
> > > processing, yes.
> > 
> > I am really not sure I udnerstand. What will prevent
> > call_rcu(&page->rcu_head, free_page_rcu) done in a random driver?
> 
> As long as it uses call_rcu(), call_rcu_bh(), call_rcu_sched(),
> or call_srcu() and not some future call_rcu_lazy(), no problem.
> 
> But yes, if you are going to assume that RCU leaves the bottom
> bit of the rcu_head structure's ->next field zero, then everything
> everywhere in the kernel might in the future need to be careful of
> exactly what variant of call_rcu() is used.

OK, so it would be call_rcu_$special to use the bit. This wasn't entirely
clear to me. I thought it would be opposite.

> > Cannot the RCU simply claim bit1? I can see 1146edcbef37 ("rcu: Loosen
> > __call_rcu()'s rcu_head alignment constraint") but AFAIU all it would
> > take to fix this would be to require struct rcu_head to be aligned to
> > 32b no?
> 
> There are some architectures that guarantee only 16-bit alignment.
> If those architectures are fixed to do 32-bit alignment, or if support
> for them is dropped, then the future restrictions mentioned above could
> be dropped.

My understanding of the discussion which led to the above patch is that
m68k allows for 32b alignment you just have to be explicit about that
(http://thread.gmane.org/gmane.linux.ports.m68k/5932/focus=5960). Which
other archs would be affected?

I mean, this patch allows for quite some simplification in the mm code.
And I think that RCU can live with mm of the low bits without any
issues. You've said that one bit should be sufficient for the RCU use
case. So having 2 bits sounds like a good thing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-27 18:14                           ` Michal Hocko
  0 siblings, 0 replies; 96+ messages in thread
From: Michal Hocko @ 2015-08-27 18:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu 27-08-15 09:36:34, Paul E. McKenney wrote:
> On Thu, Aug 27, 2015 at 05:09:17PM +0200, Michal Hocko wrote:
> > On Wed 26-08-15 14:29:16, Paul E. McKenney wrote:
> > > On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> > [...]
> > > > But if you do one day implement that, wouldn't sl?b.c have to use
> > > > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > > > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> > > 
> > > Good point, call_rcu_lazy(), but yes.
> > > 
> > > > So although it might be a little strange not to use a variant intended
> > > > for freeing memory when indeed that's what it's doing, it would not be
> > > > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > > > call_rcu(), in defence of the struct page safety Kirill is proposing.
> > > 
> > > As long as you are OK with the bottom bit being zero throughout the RCU
> > > processing, yes.
> > 
> > I am really not sure I udnerstand. What will prevent
> > call_rcu(&page->rcu_head, free_page_rcu) done in a random driver?
> 
> As long as it uses call_rcu(), call_rcu_bh(), call_rcu_sched(),
> or call_srcu() and not some future call_rcu_lazy(), no problem.
> 
> But yes, if you are going to assume that RCU leaves the bottom
> bit of the rcu_head structure's ->next field zero, then everything
> everywhere in the kernel might in the future need to be careful of
> exactly what variant of call_rcu() is used.

OK, so it would be call_rcu_$special to use the bit. This wasn't entirely
clear to me. I thought it would be opposite.

> > Cannot the RCU simply claim bit1? I can see 1146edcbef37 ("rcu: Loosen
> > __call_rcu()'s rcu_head alignment constraint") but AFAIU all it would
> > take to fix this would be to require struct rcu_head to be aligned to
> > 32b no?
> 
> There are some architectures that guarantee only 16-bit alignment.
> If those architectures are fixed to do 32-bit alignment, or if support
> for them is dropped, then the future restrictions mentioned above could
> be dropped.

My understanding of the discussion which led to the above patch is that
m68k allows for 32b alignment you just have to be explicit about that
(http://thread.gmane.org/gmane.linux.ports.m68k/5932/focus=5960). Which
other archs would be affected?

I mean, this patch allows for quite some simplification in the mm code.
And I think that RCU can live with mm of the low bits without any
issues. You've said that one bit should be sufficient for the RCU use
case. So having 2 bits sounds like a good thing.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
  2015-08-27 18:14                           ` Michal Hocko
@ 2015-08-27 19:01                             ` Paul E. McKenney
  -1 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-27 19:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu, Aug 27, 2015 at 08:14:35PM +0200, Michal Hocko wrote:
> On Thu 27-08-15 09:36:34, Paul E. McKenney wrote:
> > On Thu, Aug 27, 2015 at 05:09:17PM +0200, Michal Hocko wrote:
> > > On Wed 26-08-15 14:29:16, Paul E. McKenney wrote:
> > > > On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> > > [...]
> > > > > But if you do one day implement that, wouldn't sl?b.c have to use
> > > > > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > > > > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> > > > 
> > > > Good point, call_rcu_lazy(), but yes.
> > > > 
> > > > > So although it might be a little strange not to use a variant intended
> > > > > for freeing memory when indeed that's what it's doing, it would not be
> > > > > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > > > > call_rcu(), in defence of the struct page safety Kirill is proposing.
> > > > 
> > > > As long as you are OK with the bottom bit being zero throughout the RCU
> > > > processing, yes.
> > > 
> > > I am really not sure I udnerstand. What will prevent
> > > call_rcu(&page->rcu_head, free_page_rcu) done in a random driver?
> > 
> > As long as it uses call_rcu(), call_rcu_bh(), call_rcu_sched(),
> > or call_srcu() and not some future call_rcu_lazy(), no problem.
> > 
> > But yes, if you are going to assume that RCU leaves the bottom
> > bit of the rcu_head structure's ->next field zero, then everything
> > everywhere in the kernel might in the future need to be careful of
> > exactly what variant of call_rcu() is used.
> 
> OK, so it would be call_rcu_$special to use the bit. This wasn't entirely
> clear to me. I thought it would be opposite.

Yes.  And I cannot resist adding that the need to avoid
call_rcu_$special() would be with respect to a given rcu_head structure,
not global.  Though I believe that you already figured that out.  ;-)

> > > Cannot the RCU simply claim bit1? I can see 1146edcbef37 ("rcu: Loosen
> > > __call_rcu()'s rcu_head alignment constraint") but AFAIU all it would
> > > take to fix this would be to require struct rcu_head to be aligned to
> > > 32b no?
> > 
> > There are some architectures that guarantee only 16-bit alignment.
> > If those architectures are fixed to do 32-bit alignment, or if support
> > for them is dropped, then the future restrictions mentioned above could
> > be dropped.
> 
> My understanding of the discussion which led to the above patch is that
> m68k allows for 32b alignment you just have to be explicit about that
> (http://thread.gmane.org/gmane.linux.ports.m68k/5932/focus=5960). Which
> other archs would be affected?
> 
> I mean, this patch allows for quite some simplification in the mm code.
> And I think that RCU can live with mm of the low bits without any
> issues. You've said that one bit should be sufficient for the RCU use
> case. So having 2 bits sounds like a good thing.

As long as MM doesn't use call_rcu_$special() for the rcu_head structure
in question, as long as MM is OK with the bottom bit of ->next always
being zero during a grace period, and as long as MM avoids writing
to ->next during a grace period, we should be good as is, even if a
call_rcu_$special() becomes necessary.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCHv3 4/5] mm: make compound_head() robust
@ 2015-08-27 19:01                             ` Paul E. McKenney
  0 siblings, 0 replies; 96+ messages in thread
From: Paul E. McKenney @ 2015-08-27 19:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Kirill A. Shutemov, Andrea Arcangeli, Dave Hansen,
	Johannes Weiner, David Rientjes, linux-kernel, linux-mm,
	Christoph Lameter

On Thu, Aug 27, 2015 at 08:14:35PM +0200, Michal Hocko wrote:
> On Thu 27-08-15 09:36:34, Paul E. McKenney wrote:
> > On Thu, Aug 27, 2015 at 05:09:17PM +0200, Michal Hocko wrote:
> > > On Wed 26-08-15 14:29:16, Paul E. McKenney wrote:
> > > > On Wed, Aug 26, 2015 at 11:18:45AM -0700, Hugh Dickins wrote:
> > > [...]
> > > > > But if you do one day implement that, wouldn't sl?b.c have to use
> > > > > call_rcu_with_added_meaning() instead of call_rcu(), to be in danger
> > > > > of getting that bit set?  (No rcu_head is placed in a PageTail page.)
> > > > 
> > > > Good point, call_rcu_lazy(), but yes.
> > > > 
> > > > > So although it might be a little strange not to use a variant intended
> > > > > for freeing memory when indeed that's what it's doing, it would not be
> > > > > the end of the world for SLAB_DESTROY_BY_RCU to carry on using straight
> > > > > call_rcu(), in defence of the struct page safety Kirill is proposing.
> > > > 
> > > > As long as you are OK with the bottom bit being zero throughout the RCU
> > > > processing, yes.
> > > 
> > > I am really not sure I udnerstand. What will prevent
> > > call_rcu(&page->rcu_head, free_page_rcu) done in a random driver?
> > 
> > As long as it uses call_rcu(), call_rcu_bh(), call_rcu_sched(),
> > or call_srcu() and not some future call_rcu_lazy(), no problem.
> > 
> > But yes, if you are going to assume that RCU leaves the bottom
> > bit of the rcu_head structure's ->next field zero, then everything
> > everywhere in the kernel might in the future need to be careful of
> > exactly what variant of call_rcu() is used.
> 
> OK, so it would be call_rcu_$special to use the bit. This wasn't entirely
> clear to me. I thought it would be opposite.

Yes.  And I cannot resist adding that the need to avoid
call_rcu_$special() would be with respect to a given rcu_head structure,
not global.  Though I believe that you already figured that out.  ;-)

> > > Cannot the RCU simply claim bit1? I can see 1146edcbef37 ("rcu: Loosen
> > > __call_rcu()'s rcu_head alignment constraint") but AFAIU all it would
> > > take to fix this would be to require struct rcu_head to be aligned to
> > > 32b no?
> > 
> > There are some architectures that guarantee only 16-bit alignment.
> > If those architectures are fixed to do 32-bit alignment, or if support
> > for them is dropped, then the future restrictions mentioned above could
> > be dropped.
> 
> My understanding of the discussion which led to the above patch is that
> m68k allows for 32b alignment you just have to be explicit about that
> (http://thread.gmane.org/gmane.linux.ports.m68k/5932/focus=5960). Which
> other archs would be affected?
> 
> I mean, this patch allows for quite some simplification in the mm code.
> And I think that RCU can live with mm of the low bits without any
> issues. You've said that one bit should be sufficient for the RCU use
> case. So having 2 bits sounds like a good thing.

As long as MM doesn't use call_rcu_$special() for the rcu_head structure
in question, as long as MM is OK with the bottom bit of ->next always
being zero during a grace period, and as long as MM avoids writing
to ->next during a grace period, we should be good as is, even if a
call_rcu_$special() becomes necessary.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2015-08-27 19:10 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-19  9:21 [PATCHv3 0/5] Fix compound_head() race Kirill A. Shutemov
2015-08-19  9:21 ` Kirill A. Shutemov
2015-08-19  9:21 ` [PATCHv3 1/5] mm: drop page->slab_page Kirill A. Shutemov
2015-08-19  9:21   ` Kirill A. Shutemov
2015-08-24 14:59   ` Vlastimil Babka
2015-08-24 14:59     ` Vlastimil Babka
2015-08-24 15:02   ` Vlastimil Babka
2015-08-24 15:02     ` Vlastimil Babka
2015-08-25 17:24     ` Kirill A. Shutemov
2015-08-25 17:24       ` Kirill A. Shutemov
2015-08-19  9:21 ` [PATCHv3 2/5] zsmalloc: use page->private instead of page->first_page Kirill A. Shutemov
2015-08-19  9:21   ` Kirill A. Shutemov
2015-08-24 15:04   ` Vlastimil Babka
2015-08-24 15:04     ` Vlastimil Babka
2015-08-19  9:21 ` [PATCHv3 3/5] mm: pack compound_dtor and compound_order into one word in struct page Kirill A. Shutemov
2015-08-19  9:21   ` Kirill A. Shutemov
2015-08-20 23:26   ` Andrew Morton
2015-08-20 23:26     ` Andrew Morton
2015-08-21  7:13     ` Michal Hocko
2015-08-21  7:13       ` Michal Hocko
2015-08-21 10:40       ` Kirill A. Shutemov
2015-08-21 10:40         ` Kirill A. Shutemov
2015-08-21 10:51         ` Michal Hocko
2015-08-21 10:51           ` Michal Hocko
2015-08-19  9:21 ` [PATCHv3 4/5] mm: make compound_head() robust Kirill A. Shutemov
2015-08-19  9:21   ` Kirill A. Shutemov
2015-08-20 23:36   ` Andrew Morton
2015-08-20 23:36     ` Andrew Morton
2015-08-21 12:10     ` Kirill A. Shutemov
2015-08-21 12:10       ` Kirill A. Shutemov
2015-08-21 16:11       ` Christoph Lameter
2015-08-21 16:11         ` Christoph Lameter
2015-08-21 19:31         ` Kirill A. Shutemov
2015-08-21 19:31           ` Kirill A. Shutemov
2015-08-21 19:34           ` Andrew Morton
2015-08-21 19:34             ` Andrew Morton
2015-08-21 21:15             ` Christoph Lameter
2015-08-21 21:15               ` Christoph Lameter
2015-08-24 15:49             ` Vlastimil Babka
2015-08-24 15:49               ` Vlastimil Babka
2015-08-25 11:44       ` Vlastimil Babka
2015-08-25 11:44         ` Vlastimil Babka
2015-08-25 18:33         ` Kirill A. Shutemov
2015-08-25 18:33           ` Kirill A. Shutemov
2015-08-25 20:11           ` Paul E. McKenney
2015-08-25 20:11             ` Paul E. McKenney
2015-08-25 20:46             ` Vlastimil Babka
2015-08-25 20:46               ` Vlastimil Babka
2015-08-25 21:19               ` Paul E. McKenney
2015-08-25 21:19                 ` Paul E. McKenney
2015-08-26 15:04                 ` Kirill A. Shutemov
2015-08-26 15:04                   ` Kirill A. Shutemov
2015-08-26 15:39                   ` Vlastimil Babka
2015-08-26 15:39                     ` Vlastimil Babka
2015-08-26 16:38                   ` Paul E. McKenney
2015-08-26 16:38                     ` Paul E. McKenney
2015-08-26 18:18                 ` Hugh Dickins
2015-08-26 18:18                   ` Hugh Dickins
2015-08-26 21:29                   ` Paul E. McKenney
2015-08-26 21:29                     ` Paul E. McKenney
2015-08-26 22:28                     ` Hugh Dickins
2015-08-26 22:28                       ` Hugh Dickins
2015-08-26 23:34                       ` Paul E. McKenney
2015-08-26 23:34                         ` Paul E. McKenney
2015-08-27 15:09                     ` Michal Hocko
2015-08-27 15:09                       ` Michal Hocko
2015-08-27 16:03                       ` Michal Hocko
2015-08-27 16:03                         ` Michal Hocko
2015-08-27 17:28                         ` Hugh Dickins
2015-08-27 17:28                           ` Hugh Dickins
2015-08-27 18:06                           ` Michal Hocko
2015-08-27 18:06                             ` Michal Hocko
2015-08-27 16:36                       ` Paul E. McKenney
2015-08-27 16:36                         ` Paul E. McKenney
2015-08-27 18:14                         ` Michal Hocko
2015-08-27 18:14                           ` Michal Hocko
2015-08-27 19:01                           ` Paul E. McKenney
2015-08-27 19:01                             ` Paul E. McKenney
2015-08-23 23:59   ` Jesper Dangaard Brouer
2015-08-23 23:59     ` Jesper Dangaard Brouer
2015-08-24  9:29     ` Kirill A. Shutemov
2015-08-24  9:29       ` Kirill A. Shutemov
2015-08-24 10:17   ` Kirill A. Shutemov
2015-08-24 10:17     ` Kirill A. Shutemov
2015-08-19  9:21 ` [PATCHv3 5/5] mm: use 'unsigned int' for page order Kirill A. Shutemov
2015-08-19  9:21   ` Kirill A. Shutemov
2015-08-20  8:32   ` Michal Hocko
2015-08-20  8:32     ` Michal Hocko
2015-08-20 12:31 ` [PATCHv3 0/5] Fix compound_head() race Kirill A. Shutemov
2015-08-20 12:31   ` Kirill A. Shutemov
2015-08-20 23:38   ` Andrew Morton
2015-08-20 23:38     ` Andrew Morton
2015-08-22 20:13     ` Hugh Dickins
2015-08-22 20:13       ` Hugh Dickins
2015-08-24  9:36       ` Kirill A. Shutemov
2015-08-24  9:36         ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.