All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv5 0/7] Fix compound_head() race
@ 2015-09-03 12:35 ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

Here's my attempt on fixing race in compound_head(). It should make
compound_head() reliable in all contexts.

The last patch is optional.

It applies cleanly into mmotm patchstack just before my page-flags
patchset.

As expected, it causes few conflicts with patches:

 page-flags-introduce-page-flags-policies-wrt-compound-pages.patch
 mm-sanitize-page-mapping-for-tail-pages.patch
 include-linux-page-flagsh-rename-macros-to-avoid-collisions.patch

Updated patches with solved conflicts can be found here:

 http://marc.info/?l=linux-kernel&m=144007388303804&q=p4
 http://marc.info/?l=linux-kernel&m=144007388303804&q=p5
 http://marc.info/?l=linux-kernel&m=144007388303804&q=p3

v5:
   - Fix collision with hugetlb_cgroup (see updated patch description in
     patch 5/7);

v4:
   - init page->lru on init_reserved_page() for
     DEFERRED_STRUCT_PAGE_INIT=n;
   - fix zsmalloc breakage (repored by Sergey Senozhatsky);
   - move #ifdef CONFIG_64BIT into separate patch;
   - enum compound_dtor_id;
   - move pmd_huge_pte to other word to avoid conflict with compound_head;
   - compile-time LIST_POISON1 sanity check;
   - few cleanups around page->rcu_head;

v3:
   - Fix build without hugetlb;
   - Drop page->first_page;
   - Update comment for free_compound_page();
   - Use 'unsigned int' for page order;

v2: Per Hugh's suggestion page->compound_head is moved into third double
    word. This way we can avoid memory overhead which v1 had in some
    cases.

    This place in struct page is rather overloaded. More testing is
    required to make sure we don't collide with anyone.

Kirill A. Shutemov (7):
  mm: drop page->slab_page
  slub: use page->rcu_head instead of page->lru plus cast
  zsmalloc: use page->private instead of page->first_page
  mm: pack compound_dtor and compound_order into one word in struct page
  mm: make compound_head() robust
  mm: use 'unsigned int' for page order
  mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT

 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig      |  1 -
 include/linux/hugetlb_cgroup.h         |  4 +-
 include/linux/mm.h                     | 82 ++++++++++-----------------------
 include/linux/mm_types.h               | 30 ++++++++----
 include/linux/page-flags.h             | 80 ++++++++------------------------
 mm/Kconfig                             | 12 -----
 mm/debug.c                             |  5 --
 mm/huge_memory.c                       |  3 +-
 mm/hugetlb.c                           | 35 +++++++-------
 mm/hugetlb_cgroup.c                    |  2 +-
 mm/internal.h                          |  8 ++--
 mm/memory-failure.c                    |  7 ---
 mm/page_alloc.c                        | 84 ++++++++++++++++++++++------------
 mm/slab.c                              | 17 ++-----
 mm/slub.c                              |  5 +-
 mm/swap.c                              |  4 +-
 mm/zsmalloc.c                          | 11 ++---
 18 files changed, 156 insertions(+), 238 deletions(-)

-- 
2.5.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCHv5 0/7] Fix compound_head() race
@ 2015-09-03 12:35 ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

Here's my attempt on fixing race in compound_head(). It should make
compound_head() reliable in all contexts.

The last patch is optional.

It applies cleanly into mmotm patchstack just before my page-flags
patchset.

As expected, it causes few conflicts with patches:

 page-flags-introduce-page-flags-policies-wrt-compound-pages.patch
 mm-sanitize-page-mapping-for-tail-pages.patch
 include-linux-page-flagsh-rename-macros-to-avoid-collisions.patch

Updated patches with solved conflicts can be found here:

 http://marc.info/?l=linux-kernel&m=144007388303804&q=p4
 http://marc.info/?l=linux-kernel&m=144007388303804&q=p5
 http://marc.info/?l=linux-kernel&m=144007388303804&q=p3

v5:
   - Fix collision with hugetlb_cgroup (see updated patch description in
     patch 5/7);

v4:
   - init page->lru on init_reserved_page() for
     DEFERRED_STRUCT_PAGE_INIT=n;
   - fix zsmalloc breakage (repored by Sergey Senozhatsky);
   - move #ifdef CONFIG_64BIT into separate patch;
   - enum compound_dtor_id;
   - move pmd_huge_pte to other word to avoid conflict with compound_head;
   - compile-time LIST_POISON1 sanity check;
   - few cleanups around page->rcu_head;

v3:
   - Fix build without hugetlb;
   - Drop page->first_page;
   - Update comment for free_compound_page();
   - Use 'unsigned int' for page order;

v2: Per Hugh's suggestion page->compound_head is moved into third double
    word. This way we can avoid memory overhead which v1 had in some
    cases.

    This place in struct page is rather overloaded. More testing is
    required to make sure we don't collide with anyone.

Kirill A. Shutemov (7):
  mm: drop page->slab_page
  slub: use page->rcu_head instead of page->lru plus cast
  zsmalloc: use page->private instead of page->first_page
  mm: pack compound_dtor and compound_order into one word in struct page
  mm: make compound_head() robust
  mm: use 'unsigned int' for page order
  mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT

 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig      |  1 -
 include/linux/hugetlb_cgroup.h         |  4 +-
 include/linux/mm.h                     | 82 ++++++++++-----------------------
 include/linux/mm_types.h               | 30 ++++++++----
 include/linux/page-flags.h             | 80 ++++++++------------------------
 mm/Kconfig                             | 12 -----
 mm/debug.c                             |  5 --
 mm/huge_memory.c                       |  3 +-
 mm/hugetlb.c                           | 35 +++++++-------
 mm/hugetlb_cgroup.c                    |  2 +-
 mm/internal.h                          |  8 ++--
 mm/memory-failure.c                    |  7 ---
 mm/page_alloc.c                        | 84 ++++++++++++++++++++++------------
 mm/slab.c                              | 17 ++-----
 mm/slub.c                              |  5 +-
 mm/swap.c                              |  4 +-
 mm/zsmalloc.c                          | 11 ++---
 18 files changed, 156 insertions(+), 238 deletions(-)

-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCHv5 1/7] mm: drop page->slab_page
  2015-09-03 12:35 ` Kirill A. Shutemov
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov, Joonsoo Kim, Andi Kleen

Since 8456a648cf44 ("slab: use struct page for slab management") nobody
uses slab_page field in struct page.

Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <ak@linux.intel.com>
---
 include/linux/mm_types.h |  1 -
 mm/slab.c                | 17 +++--------------
 2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..58620ac7f15c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -140,7 +140,6 @@ struct page {
 #endif
 		};
 
-		struct slab *slab_page; /* slab fields */
 		struct rcu_head rcu_head;	/* Used by SLAB
 						 * when destroying via RCU
 						 */
diff --git a/mm/slab.c b/mm/slab.c
index 200e22412a16..649044f26e5d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1888,21 +1888,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
 
 	freelist = page->freelist;
 	slab_destroy_debugcheck(cachep, page);
-	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
-		struct rcu_head *head;
-
-		/*
-		 * RCU free overloads the RCU head over the LRU.
-		 * slab_page has been overloeaded over the LRU,
-		 * however it is not used from now on so that
-		 * we can use it safely.
-		 */
-		head = (void *)&page->rcu_head;
-		call_rcu(head, kmem_rcu_free);
-
-	} else {
+	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, kmem_rcu_free);
+	else
 		kmem_freepages(cachep, page);
-	}
 
 	/*
 	 * From now on, we don't use freelist
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 1/7] mm: drop page->slab_page
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov, Joonsoo Kim, Andi Kleen

Since 8456a648cf44 ("slab: use struct page for slab management") nobody
uses slab_page field in struct page.

Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <ak@linux.intel.com>
---
 include/linux/mm_types.h |  1 -
 mm/slab.c                | 17 +++--------------
 2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..58620ac7f15c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -140,7 +140,6 @@ struct page {
 #endif
 		};
 
-		struct slab *slab_page; /* slab fields */
 		struct rcu_head rcu_head;	/* Used by SLAB
 						 * when destroying via RCU
 						 */
diff --git a/mm/slab.c b/mm/slab.c
index 200e22412a16..649044f26e5d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1888,21 +1888,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
 
 	freelist = page->freelist;
 	slab_destroy_debugcheck(cachep, page);
-	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
-		struct rcu_head *head;
-
-		/*
-		 * RCU free overloads the RCU head over the LRU.
-		 * slab_page has been overloeaded over the LRU,
-		 * however it is not used from now on so that
-		 * we can use it safely.
-		 */
-		head = (void *)&page->rcu_head;
-		call_rcu(head, kmem_rcu_free);
-
-	} else {
+	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
+		call_rcu(&page->rcu_head, kmem_rcu_free);
+	else
 		kmem_freepages(cachep, page);
-	}
 
 	/*
 	 * From now on, we don't use freelist
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 2/7] slub: use page->rcu_head instead of page->lru plus cast
  2015-09-03 12:35 ` Kirill A. Shutemov
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov, Christoph Lameter

We have properly typed page->rcu_head, no need to cast page->lru.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
---
 mm/slub.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 816df0016555..869642d03b22 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1510,10 +1510,7 @@ static void free_slab(struct kmem_cache *s, struct page *page)
 			VM_BUG_ON(s->reserved != sizeof(*head));
 			head = page_address(page) + offset;
 		} else {
-			/*
-			 * RCU free overloads the RCU head over the LRU
-			 */
-			head = (void *)&page->lru;
+			head = &page->rcu_head;
 		}
 
 		call_rcu(head, rcu_free_slab);
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 2/7] slub: use page->rcu_head instead of page->lru plus cast
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov, Christoph Lameter

We have properly typed page->rcu_head, no need to cast page->lru.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
---
 mm/slub.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 816df0016555..869642d03b22 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1510,10 +1510,7 @@ static void free_slab(struct kmem_cache *s, struct page *page)
 			VM_BUG_ON(s->reserved != sizeof(*head));
 			head = page_address(page) + offset;
 		} else {
-			/*
-			 * RCU free overloads the RCU head over the LRU
-			 */
-			head = (void *)&page->lru;
+			head = &page->rcu_head;
 		}
 
 		call_rcu(head, rcu_free_slab);
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 3/7] zsmalloc: use page->private instead of page->first_page
  2015-09-03 12:35 ` Kirill A. Shutemov
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

We are going to rework how compound_head() work. It will not use
page->first_page as we have it now.

The only other user of page->first_page beyond compound pages is
zsmalloc.

Let's use page->private instead of page->first_page here. It occupies
the same storage space.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 mm/zsmalloc.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0a7f81aa2249..6a092de6b636 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -16,7 +16,7 @@
  * struct page(s) to form a zspage.
  *
  * Usage of struct page fields:
- *	page->first_page: points to the first component (0-order) page
+ *	page->private: points to the first component (0-order) page
  *	page->index (union with page->freelist): offset of the first object
  *		starting in this page. For the first page, this is
  *		always 0, so we use this field (aka freelist) to point
@@ -26,8 +26,7 @@
  *
  *	For _first_ page only:
  *
- *	page->private (union with page->first_page): refers to the
- *		component page after the first page
+ *	page->private: refers to the component page after the first page
  *		If the page is first_page for huge object, it stores handle.
  *		Look at size_class->huge.
  *	page->freelist: points to the first free object in zspage.
@@ -770,7 +769,7 @@ static struct page *get_first_page(struct page *page)
 	if (is_first_page(page))
 		return page;
 	else
-		return page->first_page;
+		return (struct page *)page_private(page);
 }
 
 static struct page *get_next_page(struct page *page)
@@ -955,7 +954,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	 * Allocate individual pages and link them together as:
 	 * 1. first page->private = first sub-page
 	 * 2. all sub-pages are linked together using page->lru
-	 * 3. each sub-page is linked to the first page using page->first_page
+	 * 3. each sub-page is linked to the first page using page->private
 	 *
 	 * For each size class, First/Head pages are linked together using
 	 * page->lru. Also, we set PG_private to identify the first page
@@ -980,7 +979,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
 		if (i >= 1)
-			page->first_page = first_page;
+			set_page_private(page, (unsigned long)first_page);
 		if (i >= 2)
 			list_add(&page->lru, &prev_page->lru);
 		if (i == class->pages_per_zspage - 1)	/* last page */
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 3/7] zsmalloc: use page->private instead of page->first_page
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

We are going to rework how compound_head() work. It will not use
page->first_page as we have it now.

The only other user of page->first_page beyond compound pages is
zsmalloc.

Let's use page->private instead of page->first_page here. It occupies
the same storage space.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
---
 mm/zsmalloc.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0a7f81aa2249..6a092de6b636 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -16,7 +16,7 @@
  * struct page(s) to form a zspage.
  *
  * Usage of struct page fields:
- *	page->first_page: points to the first component (0-order) page
+ *	page->private: points to the first component (0-order) page
  *	page->index (union with page->freelist): offset of the first object
  *		starting in this page. For the first page, this is
  *		always 0, so we use this field (aka freelist) to point
@@ -26,8 +26,7 @@
  *
  *	For _first_ page only:
  *
- *	page->private (union with page->first_page): refers to the
- *		component page after the first page
+ *	page->private: refers to the component page after the first page
  *		If the page is first_page for huge object, it stores handle.
  *		Look at size_class->huge.
  *	page->freelist: points to the first free object in zspage.
@@ -770,7 +769,7 @@ static struct page *get_first_page(struct page *page)
 	if (is_first_page(page))
 		return page;
 	else
-		return page->first_page;
+		return (struct page *)page_private(page);
 }
 
 static struct page *get_next_page(struct page *page)
@@ -955,7 +954,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	 * Allocate individual pages and link them together as:
 	 * 1. first page->private = first sub-page
 	 * 2. all sub-pages are linked together using page->lru
-	 * 3. each sub-page is linked to the first page using page->first_page
+	 * 3. each sub-page is linked to the first page using page->private
 	 *
 	 * For each size class, First/Head pages are linked together using
 	 * page->lru. Also, we set PG_private to identify the first page
@@ -980,7 +979,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
 		if (i >= 1)
-			page->first_page = first_page;
+			set_page_private(page, (unsigned long)first_page);
 		if (i >= 2)
 			list_add(&page->lru, &prev_page->lru);
 		if (i == class->pages_per_zspage - 1)	/* last page */
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 4/7] mm: pack compound_dtor and compound_order into one word in struct page
  2015-09-03 12:35 ` Kirill A. Shutemov
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

The patch halves space occupied by compound_dtor and compound_order in
struct page.

For compound_order, it's trivial long -> short conversion.

For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.

This patch free up one word in tail pages for reuse. This is preparation
for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h       | 24 +++++++++++++++++++-----
 include/linux/mm_types.h |  6 ++----
 mm/hugetlb.c             |  8 ++++----
 mm/page_alloc.c          | 11 ++++++++++-
 4 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f92dbac..2cfe6051574c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -575,18 +575,32 @@ int split_free_page(struct page *page);
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
- * These are _only_ valid on the head of a PG_compound page.
+ * These are _only_ valid on the head of a compound page.
  */
+typedef void compound_page_dtor(struct page *);
+
+/* Keep the enum in sync with compound_page_dtors array in mm/page_alloc.c */
+enum compound_dtor_id {
+	NULL_COMPOUND_DTOR,
+	COMPOUND_PAGE_DTOR,
+#ifdef CONFIG_HUGETLB_PAGE
+	HUGETLB_PAGE_DTOR,
+#endif
+	NR_COMPOUND_DTORS,
+};
+extern compound_page_dtor * const compound_page_dtors[];
 
 static inline void set_compound_page_dtor(struct page *page,
-						compound_page_dtor *dtor)
+		enum compound_dtor_id compound_dtor)
 {
-	page[1].compound_dtor = dtor;
+	VM_BUG_ON_PAGE(compound_dtor >= NR_COMPOUND_DTORS, page);
+	page[1].compound_dtor = compound_dtor;
 }
 
 static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 {
-	return page[1].compound_dtor;
+	VM_BUG_ON_PAGE(page[1].compound_dtor >= NR_COMPOUND_DTORS, page);
+	return compound_page_dtors[page[1].compound_dtor];
 }
 
 static inline int compound_order(struct page *page)
@@ -596,7 +610,7 @@ static inline int compound_order(struct page *page)
 	return page[1].compound_order;
 }
 
-static inline void set_compound_order(struct page *page, unsigned long order)
+static inline void set_compound_order(struct page *page, unsigned int order)
 {
 	page[1].compound_order = order;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 58620ac7f15c..408a54563f65 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -28,8 +28,6 @@ struct mem_cgroup;
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
 #define ALLOC_SPLIT_PTLOCKS	(SPINLOCK_SIZE > BITS_PER_LONG/8)
 
-typedef void compound_page_dtor(struct page *);
-
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
@@ -145,8 +143,8 @@ struct page {
 						 */
 		/* First tail page of compound page */
 		struct {
-			compound_page_dtor *compound_dtor;
-			unsigned long compound_order;
+			unsigned short int compound_dtor;
+			unsigned short int compound_order;
 		};
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087089d8..8ea74caa1fa8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -969,7 +969,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_writeback);
 	}
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-	set_compound_page_dtor(page, NULL);
+	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
@@ -1065,7 +1065,7 @@ void free_huge_page(struct page *page)
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	INIT_LIST_HEAD(&page->lru);
-	set_compound_page_dtor(page, free_huge_page);
+	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	spin_lock(&hugetlb_lock);
 	set_hugetlb_cgroup(page, NULL);
 	h->nr_huge_pages++;
@@ -1117,7 +1117,7 @@ int PageHuge(struct page *page)
 		return 0;
 
 	page = compound_head(page);
-	return get_compound_page_dtor(page) == free_huge_page;
+	return page[1].compound_dtor == HUGETLB_PAGE_DTOR;
 }
 EXPORT_SYMBOL_GPL(PageHuge);
 
@@ -1314,7 +1314,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 	if (page) {
 		INIT_LIST_HEAD(&page->lru);
 		r_nid = page_to_nid(page);
-		set_compound_page_dtor(page, free_huge_page);
+		set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 		set_hugetlb_cgroup(page, NULL);
 		/*
 		 * We incremented the global counters already
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df959b7d6085..c6733cc3cbce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,15 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "Movable",
 };
 
+static void free_compound_page(struct page *page);
+compound_page_dtor * const compound_page_dtors[] = {
+	NULL,
+	free_compound_page,
+#ifdef CONFIG_HUGETLB_PAGE
+	free_huge_page,
+#endif
+};
+
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 
@@ -437,7 +446,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 	int i;
 	int nr_pages = 1 << order;
 
-	set_compound_page_dtor(page, free_compound_page);
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
 	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 4/7] mm: pack compound_dtor and compound_order into one word in struct page
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

The patch halves space occupied by compound_dtor and compound_order in
struct page.

For compound_order, it's trivial long -> short conversion.

For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.

This patch free up one word in tail pages for reuse. This is preparation
for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h       | 24 +++++++++++++++++++-----
 include/linux/mm_types.h |  6 ++----
 mm/hugetlb.c             |  8 ++++----
 mm/page_alloc.c          | 11 ++++++++++-
 4 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f92dbac..2cfe6051574c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -575,18 +575,32 @@ int split_free_page(struct page *page);
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
- * These are _only_ valid on the head of a PG_compound page.
+ * These are _only_ valid on the head of a compound page.
  */
+typedef void compound_page_dtor(struct page *);
+
+/* Keep the enum in sync with compound_page_dtors array in mm/page_alloc.c */
+enum compound_dtor_id {
+	NULL_COMPOUND_DTOR,
+	COMPOUND_PAGE_DTOR,
+#ifdef CONFIG_HUGETLB_PAGE
+	HUGETLB_PAGE_DTOR,
+#endif
+	NR_COMPOUND_DTORS,
+};
+extern compound_page_dtor * const compound_page_dtors[];
 
 static inline void set_compound_page_dtor(struct page *page,
-						compound_page_dtor *dtor)
+		enum compound_dtor_id compound_dtor)
 {
-	page[1].compound_dtor = dtor;
+	VM_BUG_ON_PAGE(compound_dtor >= NR_COMPOUND_DTORS, page);
+	page[1].compound_dtor = compound_dtor;
 }
 
 static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 {
-	return page[1].compound_dtor;
+	VM_BUG_ON_PAGE(page[1].compound_dtor >= NR_COMPOUND_DTORS, page);
+	return compound_page_dtors[page[1].compound_dtor];
 }
 
 static inline int compound_order(struct page *page)
@@ -596,7 +610,7 @@ static inline int compound_order(struct page *page)
 	return page[1].compound_order;
 }
 
-static inline void set_compound_order(struct page *page, unsigned long order)
+static inline void set_compound_order(struct page *page, unsigned int order)
 {
 	page[1].compound_order = order;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 58620ac7f15c..408a54563f65 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -28,8 +28,6 @@ struct mem_cgroup;
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
 #define ALLOC_SPLIT_PTLOCKS	(SPINLOCK_SIZE > BITS_PER_LONG/8)
 
-typedef void compound_page_dtor(struct page *);
-
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
@@ -145,8 +143,8 @@ struct page {
 						 */
 		/* First tail page of compound page */
 		struct {
-			compound_page_dtor *compound_dtor;
-			unsigned long compound_order;
+			unsigned short int compound_dtor;
+			unsigned short int compound_order;
 		};
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087089d8..8ea74caa1fa8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -969,7 +969,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_writeback);
 	}
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
-	set_compound_page_dtor(page, NULL);
+	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
@@ -1065,7 +1065,7 @@ void free_huge_page(struct page *page)
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	INIT_LIST_HEAD(&page->lru);
-	set_compound_page_dtor(page, free_huge_page);
+	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	spin_lock(&hugetlb_lock);
 	set_hugetlb_cgroup(page, NULL);
 	h->nr_huge_pages++;
@@ -1117,7 +1117,7 @@ int PageHuge(struct page *page)
 		return 0;
 
 	page = compound_head(page);
-	return get_compound_page_dtor(page) == free_huge_page;
+	return page[1].compound_dtor == HUGETLB_PAGE_DTOR;
 }
 EXPORT_SYMBOL_GPL(PageHuge);
 
@@ -1314,7 +1314,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 	if (page) {
 		INIT_LIST_HEAD(&page->lru);
 		r_nid = page_to_nid(page);
-		set_compound_page_dtor(page, free_huge_page);
+		set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 		set_hugetlb_cgroup(page, NULL);
 		/*
 		 * We incremented the global counters already
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df959b7d6085..c6733cc3cbce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -208,6 +208,15 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "Movable",
 };
 
+static void free_compound_page(struct page *page);
+compound_page_dtor * const compound_page_dtors[] = {
+	NULL,
+	free_compound_page,
+#ifdef CONFIG_HUGETLB_PAGE
+	free_huge_page,
+#endif
+};
+
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 
@@ -437,7 +446,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 	int i;
 	int nr_pages = 1 << order;
 
-	set_compound_page_dtor(page, free_compound_page);
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
 	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 5/7] mm: make compound_head() robust
  2015-09-03 12:35 ` Kirill A. Shutemov
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov, Paul E. McKenney

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

	CPU0					CPU1

isolate_migratepages_block()
  page_count()
    compound_head()
      !!PageTail() == true
					put_page()
					  tail->first_page = NULL
      head = tail->first_page
					alloc_pages(__GFP_COMP)
					   prep_compound_page()
					     tail->first_page = head
					     __SetPageTail(p);
      !!PageTail() == true
    <head == NULL dereferencing>

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

 - page->lru.next;
 - page->next;
 - page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig      |  1 -
 include/linux/hugetlb_cgroup.h         |  4 +-
 include/linux/mm.h                     | 53 ++--------------------
 include/linux/mm_types.h               | 18 ++++++--
 include/linux/page-flags.h             | 80 ++++++++--------------------------
 mm/Kconfig                             | 12 -----
 mm/debug.c                             |  5 ---
 mm/huge_memory.c                       |  3 +-
 mm/hugetlb.c                           |  8 +---
 mm/hugetlb_cgroup.c                    |  2 +-
 mm/internal.h                          |  4 +-
 mm/memory-failure.c                    |  7 ---
 mm/page_alloc.c                        | 46 +++++++++++--------
 mm/swap.c                              |  4 +-
 15 files changed, 77 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock
index 6dea4fd5c961..62842a857dab 100644
--- a/Documentation/vm/split_page_table_lock
+++ b/Documentation/vm/split_page_table_lock
@@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
 which must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
-allocation: slab uses page->slab_cache and page->first_page for its pages.
-These fields share storage with page->ptl.
+allocation: slab uses page->slab_cache for its pages.
+This field shares storage with page->ptl.
 
 PMD split lock only makes sense if you have more than two page table
 levels.
diff --git a/arch/xtensa/configs/iss_defconfig b/arch/xtensa/configs/iss_defconfig
index e4d193e7a300..5c7c385f21c4 100644
--- a/arch/xtensa/configs/iss_defconfig
+++ b/arch/xtensa/configs/iss_defconfig
@@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
 # CONFIG_SPARSEMEM_MANUAL is not set
 CONFIG_FLATMEM=y
 CONFIG_FLAT_NODE_MEM_MAP=y
-CONFIG_PAGEFLAGS_EXTENDED=y
 CONFIG_SPLIT_PTLOCK_CPUS=4
 # CONFIG_PHYS_ADDR_T_64BIT is not set
 CONFIG_ZONE_DMA_FLAG=1
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index bcc853eccc85..75e34b900748 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -32,7 +32,7 @@ static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
 
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return NULL;
-	return (struct hugetlb_cgroup *)page[2].lru.next;
+	return (struct hugetlb_cgroup *)page[2].private;
 }
 
 static inline
@@ -42,7 +42,7 @@ int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
 
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return -1;
-	page[2].lru.next = (void *)h_cg;
+	page[2].private	= (unsigned long)h_cg;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2cfe6051574c..9fc7dc8a49af 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct page *page,
 #endif
 }
 
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
-	struct page *head = tail->first_page;
-
-	/*
-	 * page->first_page may be a dangling pointer to an old
-	 * compound page, so recheck that it is still a tail
-	 * page before returning.
-	 */
-	smp_rmb();
-	if (likely(PageTail(tail)))
-		return head;
-	return tail;
-}
-
-/*
- * Since either compound page could be dismantled asynchronously in THP
- * or we access asynchronously arbitrary positioned struct page, there
- * would be tail flag race. To handle this race, we should call
- * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
- */
-static inline struct page *compound_head(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return compound_head_by_tail(page);
-	return page;
-}
-
-/*
- * If we access compound page synchronously such as access to
- * allocated page, there is no need to handle tail flag race, so we can
- * check tail flag directly without any synchronization primitive.
- */
-static inline struct page *compound_head_fast(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return page->first_page;
-	return page;
-}
-
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -525,7 +485,7 @@ static inline void get_huge_page_tail(struct page *page)
 	VM_BUG_ON_PAGE(!PageTail(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
+	if (compound_tail_refcounted(compound_head(page)))
 		atomic_inc(&page->_mapcount);
 }
 
@@ -548,13 +508,7 @@ static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
 
-	/*
-	 * We don't need to worry about synchronization of tail flag
-	 * when we call virt_to_head_page() since it is only called for
-	 * already allocated page and this page won't be freed until
-	 * this virt_to_head_page() is finished. So use _fast variant.
-	 */
-	return compound_head_fast(page);
+	return compound_head(page);
 }
 
 /*
@@ -1496,8 +1450,7 @@ static inline bool ptlock_init(struct page *page)
 	 * with 0. Make sure nobody took it in use in between.
 	 *
 	 * It can happen if arch try to use slab for page table allocation:
-	 * slab code uses page->slab_cache and page->first_page (for tail
-	 * pages), which share storage with page->ptl.
+	 * slab code uses page->slab_cache, which share storage with page->ptl.
 	 */
 	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
 	if (!ptlock_alloc(page))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 408a54563f65..ecaf3b1d0216 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,7 +120,13 @@ struct page {
 		};
 	};
 
-	/* Third double word block */
+	/*
+	 * Third double word block
+	 *
+	 * WARNING: bit 0 of the first word encode PageTail(). That means
+	 * the rest users of the storage space MUST NOT use the bit to
+	 * avoid collision and false-positive PageTail().
+	 */
 	union {
 		struct list_head lru;	/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
@@ -143,12 +149,19 @@ struct page {
 						 */
 		/* First tail page of compound page */
 		struct {
+			unsigned long compound_head; /* If bit zero is set */
 			unsigned short int compound_dtor;
 			unsigned short int compound_order;
 		};
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
-		pgtable_t pmd_huge_pte; /* protected by page->ptl */
+		struct {
+			unsigned long __pad;	/* do not overlay pmd_huge_pte
+						 * with compound_head to avoid
+						 * possible bit 0 collision.
+						 */
+			pgtable_t pmd_huge_pte; /* protected by page->ptl */
+		};
 #endif
 	};
 
@@ -169,7 +182,6 @@ struct page {
 #endif
 #endif
 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
-		struct page *first_page;	/* Compound tail pages */
 	};
 
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 41c93844fb1d..9b865158e452 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -86,12 +86,7 @@ enum pageflags {
 	PG_private,		/* If pagecache, has fs-private data */
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	PG_head,		/* A head page */
-	PG_tail,		/* A tail page */
-#else
-	PG_compound,		/* A compound page */
-#endif
 	PG_swapcache,		/* Swap page: swp_entry_t in private */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
@@ -387,85 +382,46 @@ static inline void set_page_writeback_keepwrite(struct page *page)
 	test_set_page_writeback_keepwrite(page);
 }
 
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
-/*
- * System with lots of page flags available. This allows separate
- * flags for PageHead() and PageTail() checks of compound pages so that bit
- * tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths except arch/powerpc/mm/init_64.c
- * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
- * and avoid handling those in real mode.
- */
 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
-__PAGEFLAG(Tail, tail)
 
-static inline int PageCompound(struct page *page)
-{
-	return page->flags & ((1L << PG_head) | (1L << PG_tail));
-
-}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline void ClearPageCompound(struct page *page)
+static inline int PageTail(struct page *page)
 {
-	BUG_ON(!PageHead(page));
-	ClearPageHead(page);
+	return READ_ONCE(page->compound_head) & 1;
 }
-#endif
-
-#define PG_head_mask ((1L << PG_head))
 
-#else
-/*
- * Reduce page flag use as much as possible by overlapping
- * compound page flags with the flags used for page cache pages. Possible
- * because PageCompound is always set for compound pages and not for
- * pages on the LRU and/or pagecache.
- */
-TESTPAGEFLAG(Compound, compound)
-__SETPAGEFLAG(Head, compound)  __CLEARPAGEFLAG(Head, compound)
-
-/*
- * PG_reclaim is used in combination with PG_compound to mark the
- * head and tail of a compound page. This saves one page flag
- * but makes it impossible to use compound pages for the page cache.
- * The PG_reclaim bit would have to be used for reclaim or readahead
- * if compound pages enter the page cache.
- *
- * PG_compound & PG_reclaim	=> Tail page
- * PG_compound & ~PG_reclaim	=> Head page
- */
-#define PG_head_mask ((1L << PG_compound))
-#define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
-
-static inline int PageHead(struct page *page)
+static inline void set_compound_head(struct page *page, struct page *head)
 {
-	return ((page->flags & PG_head_tail_mask) == PG_head_mask);
+	WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
 }
 
-static inline int PageTail(struct page *page)
+static inline void clear_compound_head(struct page *page)
 {
-	return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask);
+	WRITE_ONCE(page->compound_head, 0);
 }
 
-static inline void __SetPageTail(struct page *page)
+static inline struct page *compound_head(struct page *page)
 {
-	page->flags |= PG_head_tail_mask;
+	unsigned long head = READ_ONCE(page->compound_head);
+
+	if (unlikely(head & 1))
+		return (struct page *) (head - 1);
+	return page;
 }
 
-static inline void __ClearPageTail(struct page *page)
+static inline int PageCompound(struct page *page)
 {
-	page->flags &= ~PG_head_tail_mask;
-}
+	return PageHead(page) || PageTail(page);
 
+}
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void ClearPageCompound(struct page *page)
 {
-	BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
-	clear_bit(PG_compound, &page->flags);
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
 }
 #endif
 
-#endif /* !PAGEFLAGS_EXTENDED */
+#define PG_head_mask ((1L << PG_head))
 
 #ifdef CONFIG_HUGETLB_PAGE
 int PageHuge(struct page *page);
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..454579d31081 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,18 +200,6 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
-#
-# If we have space for more page flags then we can enable additional
-# optimizations and functionality.
-#
-# Regular Sparsemem takes page flag bits for the sectionid if it does not
-# use a virtual memmap. Disable extended page flags for 32 bit platforms
-# that require the use of a sectionid in the page flags.
-#
-config PAGEFLAGS_EXTENDED
-	def_bool y
-	depends on 64BIT || SPARSEMEM_VMEMMAP || !SPARSEMEM
-
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
diff --git a/mm/debug.c b/mm/debug.c
index 76089ddf99ea..205e5ef957ab 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,12 +25,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private,		"private"	},
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{1UL << PG_head,		"head"		},
-	{1UL << PG_tail,		"tail"		},
-#else
-	{1UL << PG_compound,		"compound"	},
-#endif
 	{1UL << PG_swapcache,		"swapcache"	},
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},
 	{1UL << PG_reclaim,		"reclaim"	},
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 097c7a4bfbd9..330377f83ac7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
 				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/* clear PageTail before overwriting first_page */
-		smp_wmb();
+		clear_compound_head(page_tail);
 
 		/*
 		 * __split_huge_page_splitting() already set the
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ea74caa1fa8..53c0709fd87b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -824,9 +824,8 @@ static void destroy_compound_gigantic_page(struct page *page,
 	struct page *p = page + 1;
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		__ClearPageTail(p);
+		clear_compound_head(p);
 		set_page_refcounted(p);
-		p->first_page = NULL;
 	}
 
 	set_compound_order(page, 0);
@@ -1099,10 +1098,7 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order)
 		 */
 		__ClearPageReserved(p);
 		set_page_count(p, 0);
-		p->first_page = page;
-		/* Make sure p->first_page is always valid for PageTail() */
-		smp_wmb();
-		__SetPageTail(p);
+		set_compound_head(p, page);
 	}
 }
 
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 6e0057439a46..6a4426372698 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -384,7 +384,7 @@ void __init hugetlb_cgroup_file_init(void)
 		/*
 		 * Add cgroup control files only if the huge page consists
 		 * of more than two normal pages. This is because we use
-		 * page[2].lru.next for storing cgroup details.
+		 * page[2].private for storing cgroup details.
 		 */
 		if (huge_page_order(h) >= HUGETLB_CGROUP_MIN_ORDER)
 			__hugetlb_cgroup_file_init(hstate_index(h));
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..89e21a07080a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
 	 * speculative page access (like in
 	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
+	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
 	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
+		atomic_inc(&compound_head(page)->_count);
 	get_huge_page_tail(page);
 }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1f4446a90cef..4d1a5de9653d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 #define lru		(1UL << PG_lru)
 #define swapbacked	(1UL << PG_swapbacked)
 #define head		(1UL << PG_head)
-#define tail		(1UL << PG_tail)
-#define compound	(1UL << PG_compound)
 #define slab		(1UL << PG_slab)
 #define reserved	(1UL << PG_reserved)
 
@@ -811,12 +809,7 @@ static struct page_state {
 	 */
 	{ slab,		slab,		MF_MSG_SLAB,	me_kernel },
 
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{ head,		head,		MF_MSG_HUGE,		me_huge_page },
-	{ tail,		tail,		MF_MSG_HUGE,		me_huge_page },
-#else
-	{ compound,	compound,	MF_MSG_HUGE,		me_huge_page },
-#endif
 
 	{ sc|dirty,	sc|dirty,	MF_MSG_DIRTY_SWAPCACHE,	me_swapcache_dirty },
 	{ sc|dirty,	sc,		MF_MSG_CLEAN_SWAPCACHE,	me_swapcache_clean },
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6733cc3cbce..a56ad53ff553 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -424,15 +424,15 @@ out:
 /*
  * Higher-order pages are called "compound pages".  They are structured thusly:
  *
- * The first PAGE_SIZE page is called the "head page".
+ * The first PAGE_SIZE page is called the "head page" and have PG_head set.
  *
- * The remaining PAGE_SIZE pages are called "tail pages".
+ * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
+ * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
  *
- * All pages have PG_compound set.  All tail pages have their ->first_page
- * pointing at the head page.
+ * The first tail page's ->compound_dtor holds the offset in array of compound
+ * page destructors. See compound_page_dtors.
  *
- * The first tail page's ->lru.next holds the address of the compound page's
- * put_page() function.  Its ->lru.prev holds the order of allocation.
+ * The first tail page's ->compound_order holds the order of allocation.
  * This usage means that zero-order pages may not be compound.
  */
 
@@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 		set_page_count(p, 0);
-		p->first_page = page;
-		/* Make sure p->first_page is always valid for PageTail() */
-		smp_wmb();
-		__SetPageTail(p);
+		set_compound_head(p, page);
 	}
 }
 
@@ -830,17 +827,30 @@ static void free_one_page(struct zone *zone,
 
 static int free_tail_pages_check(struct page *head_page, struct page *page)
 {
-	if (!IS_ENABLED(CONFIG_DEBUG_VM))
-		return 0;
+	int ret = 1;
+
+	/*
+	 * We rely page->lru.next never has bit 0 set, unless the page
+	 * is PageTail(). Let's make sure that's true even for poisoned ->lru.
+	 */
+	BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
+
+	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
+		ret = 0;
+		goto out;
+	}
 	if (unlikely(!PageTail(page))) {
 		bad_page(page, "PageTail not set", 0);
-		return 1;
+		goto out;
 	}
-	if (unlikely(page->first_page != head_page)) {
-		bad_page(page, "first_page not consistent", 0);
-		return 1;
+	if (unlikely(compound_head(page) != head_page)) {
+		bad_page(page, "compound_head not consistent", 0);
+		goto out;
 	}
-	return 0;
+	ret = 0;
+out:
+	clear_compound_head(page);
+	return ret;
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -888,6 +898,8 @@ static void init_reserved_page(unsigned long pfn)
 #else
 static inline void init_reserved_page(unsigned long pfn)
 {
+	/* Avoid false-positive PageTail() */
+	INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
diff --git a/mm/swap.c b/mm/swap.c
index a3a0a2f1f7c3..faa9e1687dea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,7 +200,7 @@ out_put_single:
 				__put_single_page(page);
 			return;
 		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
+		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
 		/*
 		 * We can release the refcount taken by
 		 * get_page_unless_zero() now that
@@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
 	 *  Case 3 is possible, as we may race with
 	 *  __split_huge_page_refcount tearing down a THP page.
 	 */
-	page_head = compound_head_by_tail(page);
+	page_head = compound_head(page);
 	if (!__compound_tail_refcounted(page_head))
 		put_unrefcounted_compound_page(page_head, page);
 	else
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 5/7] mm: make compound_head() robust
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov, Paul E. McKenney

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

	CPU0					CPU1

isolate_migratepages_block()
  page_count()
    compound_head()
      !!PageTail() == true
					put_page()
					  tail->first_page = NULL
      head = tail->first_page
					alloc_pages(__GFP_COMP)
					   prep_compound_page()
					     tail->first_page = head
					     __SetPageTail(p);
      !!PageTail() == true
    <head == NULL dereferencing>

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

 - page->lru.next;
 - page->next;
 - page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig      |  1 -
 include/linux/hugetlb_cgroup.h         |  4 +-
 include/linux/mm.h                     | 53 ++--------------------
 include/linux/mm_types.h               | 18 ++++++--
 include/linux/page-flags.h             | 80 ++++++++--------------------------
 mm/Kconfig                             | 12 -----
 mm/debug.c                             |  5 ---
 mm/huge_memory.c                       |  3 +-
 mm/hugetlb.c                           |  8 +---
 mm/hugetlb_cgroup.c                    |  2 +-
 mm/internal.h                          |  4 +-
 mm/memory-failure.c                    |  7 ---
 mm/page_alloc.c                        | 46 +++++++++++--------
 mm/swap.c                              |  4 +-
 15 files changed, 77 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock
index 6dea4fd5c961..62842a857dab 100644
--- a/Documentation/vm/split_page_table_lock
+++ b/Documentation/vm/split_page_table_lock
@@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
 which must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
-allocation: slab uses page->slab_cache and page->first_page for its pages.
-These fields share storage with page->ptl.
+allocation: slab uses page->slab_cache for its pages.
+This field shares storage with page->ptl.
 
 PMD split lock only makes sense if you have more than two page table
 levels.
diff --git a/arch/xtensa/configs/iss_defconfig b/arch/xtensa/configs/iss_defconfig
index e4d193e7a300..5c7c385f21c4 100644
--- a/arch/xtensa/configs/iss_defconfig
+++ b/arch/xtensa/configs/iss_defconfig
@@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
 # CONFIG_SPARSEMEM_MANUAL is not set
 CONFIG_FLATMEM=y
 CONFIG_FLAT_NODE_MEM_MAP=y
-CONFIG_PAGEFLAGS_EXTENDED=y
 CONFIG_SPLIT_PTLOCK_CPUS=4
 # CONFIG_PHYS_ADDR_T_64BIT is not set
 CONFIG_ZONE_DMA_FLAG=1
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index bcc853eccc85..75e34b900748 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -32,7 +32,7 @@ static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
 
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return NULL;
-	return (struct hugetlb_cgroup *)page[2].lru.next;
+	return (struct hugetlb_cgroup *)page[2].private;
 }
 
 static inline
@@ -42,7 +42,7 @@ int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
 
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return -1;
-	page[2].lru.next = (void *)h_cg;
+	page[2].private	= (unsigned long)h_cg;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2cfe6051574c..9fc7dc8a49af 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct page *page,
 #endif
 }
 
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
-	struct page *head = tail->first_page;
-
-	/*
-	 * page->first_page may be a dangling pointer to an old
-	 * compound page, so recheck that it is still a tail
-	 * page before returning.
-	 */
-	smp_rmb();
-	if (likely(PageTail(tail)))
-		return head;
-	return tail;
-}
-
-/*
- * Since either compound page could be dismantled asynchronously in THP
- * or we access asynchronously arbitrary positioned struct page, there
- * would be tail flag race. To handle this race, we should call
- * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
- */
-static inline struct page *compound_head(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return compound_head_by_tail(page);
-	return page;
-}
-
-/*
- * If we access compound page synchronously such as access to
- * allocated page, there is no need to handle tail flag race, so we can
- * check tail flag directly without any synchronization primitive.
- */
-static inline struct page *compound_head_fast(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		return page->first_page;
-	return page;
-}
-
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -525,7 +485,7 @@ static inline void get_huge_page_tail(struct page *page)
 	VM_BUG_ON_PAGE(!PageTail(page), page);
 	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
+	if (compound_tail_refcounted(compound_head(page)))
 		atomic_inc(&page->_mapcount);
 }
 
@@ -548,13 +508,7 @@ static inline struct page *virt_to_head_page(const void *x)
 {
 	struct page *page = virt_to_page(x);
 
-	/*
-	 * We don't need to worry about synchronization of tail flag
-	 * when we call virt_to_head_page() since it is only called for
-	 * already allocated page and this page won't be freed until
-	 * this virt_to_head_page() is finished. So use _fast variant.
-	 */
-	return compound_head_fast(page);
+	return compound_head(page);
 }
 
 /*
@@ -1496,8 +1450,7 @@ static inline bool ptlock_init(struct page *page)
 	 * with 0. Make sure nobody took it in use in between.
 	 *
 	 * It can happen if arch try to use slab for page table allocation:
-	 * slab code uses page->slab_cache and page->first_page (for tail
-	 * pages), which share storage with page->ptl.
+	 * slab code uses page->slab_cache, which share storage with page->ptl.
 	 */
 	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
 	if (!ptlock_alloc(page))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 408a54563f65..ecaf3b1d0216 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,7 +120,13 @@ struct page {
 		};
 	};
 
-	/* Third double word block */
+	/*
+	 * Third double word block
+	 *
+	 * WARNING: bit 0 of the first word encode PageTail(). That means
+	 * the rest users of the storage space MUST NOT use the bit to
+	 * avoid collision and false-positive PageTail().
+	 */
 	union {
 		struct list_head lru;	/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
@@ -143,12 +149,19 @@ struct page {
 						 */
 		/* First tail page of compound page */
 		struct {
+			unsigned long compound_head; /* If bit zero is set */
 			unsigned short int compound_dtor;
 			unsigned short int compound_order;
 		};
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
-		pgtable_t pmd_huge_pte; /* protected by page->ptl */
+		struct {
+			unsigned long __pad;	/* do not overlay pmd_huge_pte
+						 * with compound_head to avoid
+						 * possible bit 0 collision.
+						 */
+			pgtable_t pmd_huge_pte; /* protected by page->ptl */
+		};
 #endif
 	};
 
@@ -169,7 +182,6 @@ struct page {
 #endif
 #endif
 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
-		struct page *first_page;	/* Compound tail pages */
 	};
 
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 41c93844fb1d..9b865158e452 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -86,12 +86,7 @@ enum pageflags {
 	PG_private,		/* If pagecache, has fs-private data */
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	PG_head,		/* A head page */
-	PG_tail,		/* A tail page */
-#else
-	PG_compound,		/* A compound page */
-#endif
 	PG_swapcache,		/* Swap page: swp_entry_t in private */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
@@ -387,85 +382,46 @@ static inline void set_page_writeback_keepwrite(struct page *page)
 	test_set_page_writeback_keepwrite(page);
 }
 
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
-/*
- * System with lots of page flags available. This allows separate
- * flags for PageHead() and PageTail() checks of compound pages so that bit
- * tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths except arch/powerpc/mm/init_64.c
- * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
- * and avoid handling those in real mode.
- */
 __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
-__PAGEFLAG(Tail, tail)
 
-static inline int PageCompound(struct page *page)
-{
-	return page->flags & ((1L << PG_head) | (1L << PG_tail));
-
-}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline void ClearPageCompound(struct page *page)
+static inline int PageTail(struct page *page)
 {
-	BUG_ON(!PageHead(page));
-	ClearPageHead(page);
+	return READ_ONCE(page->compound_head) & 1;
 }
-#endif
-
-#define PG_head_mask ((1L << PG_head))
 
-#else
-/*
- * Reduce page flag use as much as possible by overlapping
- * compound page flags with the flags used for page cache pages. Possible
- * because PageCompound is always set for compound pages and not for
- * pages on the LRU and/or pagecache.
- */
-TESTPAGEFLAG(Compound, compound)
-__SETPAGEFLAG(Head, compound)  __CLEARPAGEFLAG(Head, compound)
-
-/*
- * PG_reclaim is used in combination with PG_compound to mark the
- * head and tail of a compound page. This saves one page flag
- * but makes it impossible to use compound pages for the page cache.
- * The PG_reclaim bit would have to be used for reclaim or readahead
- * if compound pages enter the page cache.
- *
- * PG_compound & PG_reclaim	=> Tail page
- * PG_compound & ~PG_reclaim	=> Head page
- */
-#define PG_head_mask ((1L << PG_compound))
-#define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
-
-static inline int PageHead(struct page *page)
+static inline void set_compound_head(struct page *page, struct page *head)
 {
-	return ((page->flags & PG_head_tail_mask) == PG_head_mask);
+	WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
 }
 
-static inline int PageTail(struct page *page)
+static inline void clear_compound_head(struct page *page)
 {
-	return ((page->flags & PG_head_tail_mask) == PG_head_tail_mask);
+	WRITE_ONCE(page->compound_head, 0);
 }
 
-static inline void __SetPageTail(struct page *page)
+static inline struct page *compound_head(struct page *page)
 {
-	page->flags |= PG_head_tail_mask;
+	unsigned long head = READ_ONCE(page->compound_head);
+
+	if (unlikely(head & 1))
+		return (struct page *) (head - 1);
+	return page;
 }
 
-static inline void __ClearPageTail(struct page *page)
+static inline int PageCompound(struct page *page)
 {
-	page->flags &= ~PG_head_tail_mask;
-}
+	return PageHead(page) || PageTail(page);
 
+}
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void ClearPageCompound(struct page *page)
 {
-	BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
-	clear_bit(PG_compound, &page->flags);
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
 }
 #endif
 
-#endif /* !PAGEFLAGS_EXTENDED */
+#define PG_head_mask ((1L << PG_head))
 
 #ifdef CONFIG_HUGETLB_PAGE
 int PageHuge(struct page *page);
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..454579d31081 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,18 +200,6 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
-#
-# If we have space for more page flags then we can enable additional
-# optimizations and functionality.
-#
-# Regular Sparsemem takes page flag bits for the sectionid if it does not
-# use a virtual memmap. Disable extended page flags for 32 bit platforms
-# that require the use of a sectionid in the page flags.
-#
-config PAGEFLAGS_EXTENDED
-	def_bool y
-	depends on 64BIT || SPARSEMEM_VMEMMAP || !SPARSEMEM
-
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
diff --git a/mm/debug.c b/mm/debug.c
index 76089ddf99ea..205e5ef957ab 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,12 +25,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private,		"private"	},
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{1UL << PG_head,		"head"		},
-	{1UL << PG_tail,		"tail"		},
-#else
-	{1UL << PG_compound,		"compound"	},
-#endif
 	{1UL << PG_swapcache,		"swapcache"	},
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},
 	{1UL << PG_reclaim,		"reclaim"	},
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 097c7a4bfbd9..330377f83ac7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
 				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
-		/* clear PageTail before overwriting first_page */
-		smp_wmb();
+		clear_compound_head(page_tail);
 
 		/*
 		 * __split_huge_page_splitting() already set the
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ea74caa1fa8..53c0709fd87b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -824,9 +824,8 @@ static void destroy_compound_gigantic_page(struct page *page,
 	struct page *p = page + 1;
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
-		__ClearPageTail(p);
+		clear_compound_head(p);
 		set_page_refcounted(p);
-		p->first_page = NULL;
 	}
 
 	set_compound_order(page, 0);
@@ -1099,10 +1098,7 @@ static void prep_compound_gigantic_page(struct page *page, unsigned long order)
 		 */
 		__ClearPageReserved(p);
 		set_page_count(p, 0);
-		p->first_page = page;
-		/* Make sure p->first_page is always valid for PageTail() */
-		smp_wmb();
-		__SetPageTail(p);
+		set_compound_head(p, page);
 	}
 }
 
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 6e0057439a46..6a4426372698 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -384,7 +384,7 @@ void __init hugetlb_cgroup_file_init(void)
 		/*
 		 * Add cgroup control files only if the huge page consists
 		 * of more than two normal pages. This is because we use
-		 * page[2].lru.next for storing cgroup details.
+		 * page[2].private for storing cgroup details.
 		 */
 		if (huge_page_order(h) >= HUGETLB_CGROUP_MIN_ORDER)
 			__hugetlb_cgroup_file_init(hstate_index(h));
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..89e21a07080a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
 	 * speculative page access (like in
 	 * page_cache_get_speculative()) on tail pages.
 	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
+	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
 	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
+		atomic_inc(&compound_head(page)->_count);
 	get_huge_page_tail(page);
 }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1f4446a90cef..4d1a5de9653d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 #define lru		(1UL << PG_lru)
 #define swapbacked	(1UL << PG_swapbacked)
 #define head		(1UL << PG_head)
-#define tail		(1UL << PG_tail)
-#define compound	(1UL << PG_compound)
 #define slab		(1UL << PG_slab)
 #define reserved	(1UL << PG_reserved)
 
@@ -811,12 +809,7 @@ static struct page_state {
 	 */
 	{ slab,		slab,		MF_MSG_SLAB,	me_kernel },
 
-#ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{ head,		head,		MF_MSG_HUGE,		me_huge_page },
-	{ tail,		tail,		MF_MSG_HUGE,		me_huge_page },
-#else
-	{ compound,	compound,	MF_MSG_HUGE,		me_huge_page },
-#endif
 
 	{ sc|dirty,	sc|dirty,	MF_MSG_DIRTY_SWAPCACHE,	me_swapcache_dirty },
 	{ sc|dirty,	sc,		MF_MSG_CLEAN_SWAPCACHE,	me_swapcache_clean },
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6733cc3cbce..a56ad53ff553 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -424,15 +424,15 @@ out:
 /*
  * Higher-order pages are called "compound pages".  They are structured thusly:
  *
- * The first PAGE_SIZE page is called the "head page".
+ * The first PAGE_SIZE page is called the "head page" and have PG_head set.
  *
- * The remaining PAGE_SIZE pages are called "tail pages".
+ * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
+ * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
  *
- * All pages have PG_compound set.  All tail pages have their ->first_page
- * pointing at the head page.
+ * The first tail page's ->compound_dtor holds the offset in array of compound
+ * page destructors. See compound_page_dtors.
  *
- * The first tail page's ->lru.next holds the address of the compound page's
- * put_page() function.  Its ->lru.prev holds the order of allocation.
+ * The first tail page's ->compound_order holds the order of allocation.
  * This usage means that zero-order pages may not be compound.
  */
 
@@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 		set_page_count(p, 0);
-		p->first_page = page;
-		/* Make sure p->first_page is always valid for PageTail() */
-		smp_wmb();
-		__SetPageTail(p);
+		set_compound_head(p, page);
 	}
 }
 
@@ -830,17 +827,30 @@ static void free_one_page(struct zone *zone,
 
 static int free_tail_pages_check(struct page *head_page, struct page *page)
 {
-	if (!IS_ENABLED(CONFIG_DEBUG_VM))
-		return 0;
+	int ret = 1;
+
+	/*
+	 * We rely page->lru.next never has bit 0 set, unless the page
+	 * is PageTail(). Let's make sure that's true even for poisoned ->lru.
+	 */
+	BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
+
+	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
+		ret = 0;
+		goto out;
+	}
 	if (unlikely(!PageTail(page))) {
 		bad_page(page, "PageTail not set", 0);
-		return 1;
+		goto out;
 	}
-	if (unlikely(page->first_page != head_page)) {
-		bad_page(page, "first_page not consistent", 0);
-		return 1;
+	if (unlikely(compound_head(page) != head_page)) {
+		bad_page(page, "compound_head not consistent", 0);
+		goto out;
 	}
-	return 0;
+	ret = 0;
+out:
+	clear_compound_head(page);
+	return ret;
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -888,6 +898,8 @@ static void init_reserved_page(unsigned long pfn)
 #else
 static inline void init_reserved_page(unsigned long pfn)
 {
+	/* Avoid false-positive PageTail() */
+	INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
diff --git a/mm/swap.c b/mm/swap.c
index a3a0a2f1f7c3..faa9e1687dea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,7 +200,7 @@ out_put_single:
 				__put_single_page(page);
 			return;
 		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
+		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
 		/*
 		 * We can release the refcount taken by
 		 * get_page_unless_zero() now that
@@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
 	 *  Case 3 is possible, as we may race with
 	 *  __split_huge_page_refcount tearing down a THP page.
 	 */
-	page_head = compound_head_by_tail(page);
+	page_head = compound_head(page);
 	if (!__compound_tail_refcounted(page_head))
 		put_unrefcounted_compound_page(page_head, page);
 	else
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 6/7] mm: use 'unsigned int' for page order
  2015-09-03 12:35 ` Kirill A. Shutemov
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

Let's try to be consistent about data type of page order.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |  5 +++--
 mm/hugetlb.c       | 19 ++++++++++---------
 mm/internal.h      |  4 ++--
 mm/page_alloc.c    | 27 +++++++++++++++------------
 4 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9fc7dc8a49af..00a78439f392 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -557,7 +557,7 @@ static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 	return compound_page_dtors[page[1].compound_dtor];
 }
 
-static inline int compound_order(struct page *page)
+static inline unsigned int compound_order(struct page *page)
 {
 	if (!PageHead(page))
 		return 0;
@@ -1718,7 +1718,8 @@ extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
 extern __printf(3, 4)
-void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
+void warn_alloc_failed(gfp_t gfp_mask, unsigned int order,
+		const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 53c0709fd87b..bf64bfebc473 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -817,7 +817,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 
 #if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
 static void destroy_compound_gigantic_page(struct page *page,
-					unsigned long order)
+					unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -832,7 +832,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned order)
+static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	free_contig_range(page_to_pfn(page), 1 << order);
 }
@@ -876,7 +876,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
 	return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(int nid, unsigned order)
+static struct page *alloc_gigantic_page(int nid, unsigned int order)
 {
 	unsigned long nr_pages = 1 << order;
 	unsigned long ret, pfn, flags;
@@ -912,7 +912,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
-static void prep_compound_gigantic_page(struct page *page, unsigned long order);
+static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 
 static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
 {
@@ -945,9 +945,9 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
 static inline bool gigantic_page_supported(void) { return true; }
 #else
 static inline bool gigantic_page_supported(void) { return false; }
-static inline void free_gigantic_page(struct page *page, unsigned order) { }
+static inline void free_gigantic_page(struct page *page, unsigned int order) { }
 static inline void destroy_compound_gigantic_page(struct page *page,
-						unsigned long order) { }
+						unsigned int order) { }
 static inline int alloc_fresh_gigantic_page(struct hstate *h,
 					nodemask_t *nodes_allowed) { return 0; }
 #endif
@@ -1073,7 +1073,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static void prep_compound_gigantic_page(struct page *page, unsigned long order)
+static void prep_compound_gigantic_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1640,7 +1640,8 @@ found:
 	return 1;
 }
 
-static void __init prep_compound_huge_page(struct page *page, int order)
+static void __init prep_compound_huge_page(struct page *page,
+		unsigned int order)
 {
 	if (unlikely(order > (MAX_ORDER - 1)))
 		prep_compound_gigantic_page(page, order);
@@ -2351,7 +2352,7 @@ static int __init hugetlb_init(void)
 module_init(hugetlb_init);
 
 /* Should be called on processing a hugepagesz=... option */
-void __init hugetlb_add_hstate(unsigned order)
+void __init hugetlb_add_hstate(unsigned int order)
 {
 	struct hstate *h;
 	unsigned long i;
diff --git a/mm/internal.h b/mm/internal.h
index 89e21a07080a..9a9fc497593f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,7 +157,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
 extern int __isolate_free_page(struct page *page, unsigned int order);
 extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
 					unsigned int order);
-extern void prep_compound_page(struct page *page, unsigned long order);
+extern void prep_compound_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -214,7 +214,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
  * page cannot be allocated or merged in parallel. Alternatively, it must
  * handle invalid values gracefully, and use page_order_unsafe() below.
  */
-static inline unsigned long page_order(struct page *page)
+static inline unsigned int page_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
 	return page_private(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a56ad53ff553..81cfaca6b9fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -163,7 +163,7 @@ bool pm_suspended_storage(void)
 #endif /* CONFIG_PM_SLEEP */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
-int pageblock_order __read_mostly;
+unsigned int pageblock_order __read_mostly;
 #endif
 
 static void __free_pages_ok(struct page *page, unsigned int order);
@@ -441,7 +441,7 @@ static void free_compound_page(struct page *page)
 	__free_pages_ok(page, compound_order(page));
 }
 
-void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -641,7 +641,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_idx;
 	unsigned long uninitialized_var(buddy_idx);
 	struct page *buddy;
-	int max_order = MAX_ORDER;
+	unsigned int max_order = MAX_ORDER;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1444,7 +1444,7 @@ int move_freepages(struct zone *zone,
 			  int migratetype)
 {
 	struct page *page;
-	unsigned long order;
+	unsigned int order;
 	int pages_moved = 0;
 
 #ifndef CONFIG_HOLES_IN_ZONE
@@ -1558,7 +1558,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 static void steal_suitable_fallback(struct zone *zone, struct page *page,
 							  int start_type)
 {
-	int current_order = page_order(page);
+	unsigned int current_order = page_order(page);
 	int pages;
 
 	/* Take ownership for orders >= pageblock_order */
@@ -2665,7 +2665,7 @@ static DEFINE_RATELIMIT_STATE(nopage_rs,
 		DEFAULT_RATELIMIT_INTERVAL,
 		DEFAULT_RATELIMIT_BURST);
 
-void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
+void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
 {
 	unsigned int filter = SHOW_MEM_FILTER_NODES;
 
@@ -2699,7 +2699,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		va_end(args);
 	}
 
-	pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n",
+	pr_warn("%s: page allocation failure: order:%u, mode:0x%x\n",
 		current->comm, order, gfp_mask);
 
 	dump_stack();
@@ -3458,7 +3458,8 @@ void free_kmem_pages(unsigned long addr, unsigned int order)
 	}
 }
 
-static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
+static void *make_alloc_exact(unsigned long addr, unsigned int order,
+		size_t size)
 {
 	if (addr) {
 		unsigned long alloc_end = addr + (PAGE_SIZE << order);
@@ -3510,7 +3511,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
  */
 void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
 {
-	unsigned order = get_order(size);
+	unsigned int order = get_order(size);
 	struct page *p = alloc_pages_node(nid, gfp_mask, order);
 	if (!p)
 		return NULL;
@@ -3812,7 +3813,8 @@ void show_free_areas(unsigned int filter)
 	}
 
 	for_each_populated_zone(zone) {
-		unsigned long nr[MAX_ORDER], flags, order, total = 0;
+		unsigned int order;
+		unsigned long nr[MAX_ORDER], flags, total = 0;
 		unsigned char types[MAX_ORDER];
 
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -4161,7 +4163,7 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
-	int order = current_zonelist_order;
+	unsigned int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -6826,7 +6828,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype)
 {
 	unsigned long outer_start, outer_end;
-	int ret = 0, order;
+	unsigned int order;
+	int ret = 0;
 
 	struct compact_control cc = {
 		.nr_migratepages = 0,
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 6/7] mm: use 'unsigned int' for page order
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

Let's try to be consistent about data type of page order.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |  5 +++--
 mm/hugetlb.c       | 19 ++++++++++---------
 mm/internal.h      |  4 ++--
 mm/page_alloc.c    | 27 +++++++++++++++------------
 4 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9fc7dc8a49af..00a78439f392 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -557,7 +557,7 @@ static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
 	return compound_page_dtors[page[1].compound_dtor];
 }
 
-static inline int compound_order(struct page *page)
+static inline unsigned int compound_order(struct page *page)
 {
 	if (!PageHead(page))
 		return 0;
@@ -1718,7 +1718,8 @@ extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
 extern __printf(3, 4)
-void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
+void warn_alloc_failed(gfp_t gfp_mask, unsigned int order,
+		const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 53c0709fd87b..bf64bfebc473 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -817,7 +817,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 
 #if defined(CONFIG_CMA) && defined(CONFIG_X86_64)
 static void destroy_compound_gigantic_page(struct page *page,
-					unsigned long order)
+					unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -832,7 +832,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned order)
+static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	free_contig_range(page_to_pfn(page), 1 << order);
 }
@@ -876,7 +876,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
 	return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(int nid, unsigned order)
+static struct page *alloc_gigantic_page(int nid, unsigned int order)
 {
 	unsigned long nr_pages = 1 << order;
 	unsigned long ret, pfn, flags;
@@ -912,7 +912,7 @@ static struct page *alloc_gigantic_page(int nid, unsigned order)
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
-static void prep_compound_gigantic_page(struct page *page, unsigned long order);
+static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 
 static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
 {
@@ -945,9 +945,9 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
 static inline bool gigantic_page_supported(void) { return true; }
 #else
 static inline bool gigantic_page_supported(void) { return false; }
-static inline void free_gigantic_page(struct page *page, unsigned order) { }
+static inline void free_gigantic_page(struct page *page, unsigned int order) { }
 static inline void destroy_compound_gigantic_page(struct page *page,
-						unsigned long order) { }
+						unsigned int order) { }
 static inline int alloc_fresh_gigantic_page(struct hstate *h,
 					nodemask_t *nodes_allowed) { return 0; }
 #endif
@@ -1073,7 +1073,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	put_page(page); /* free it into the hugepage allocator */
 }
 
-static void prep_compound_gigantic_page(struct page *page, unsigned long order)
+static void prep_compound_gigantic_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1640,7 +1640,8 @@ found:
 	return 1;
 }
 
-static void __init prep_compound_huge_page(struct page *page, int order)
+static void __init prep_compound_huge_page(struct page *page,
+		unsigned int order)
 {
 	if (unlikely(order > (MAX_ORDER - 1)))
 		prep_compound_gigantic_page(page, order);
@@ -2351,7 +2352,7 @@ static int __init hugetlb_init(void)
 module_init(hugetlb_init);
 
 /* Should be called on processing a hugepagesz=... option */
-void __init hugetlb_add_hstate(unsigned order)
+void __init hugetlb_add_hstate(unsigned int order)
 {
 	struct hstate *h;
 	unsigned long i;
diff --git a/mm/internal.h b/mm/internal.h
index 89e21a07080a..9a9fc497593f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,7 +157,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
 extern int __isolate_free_page(struct page *page, unsigned int order);
 extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
 					unsigned int order);
-extern void prep_compound_page(struct page *page, unsigned long order);
+extern void prep_compound_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -214,7 +214,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
  * page cannot be allocated or merged in parallel. Alternatively, it must
  * handle invalid values gracefully, and use page_order_unsafe() below.
  */
-static inline unsigned long page_order(struct page *page)
+static inline unsigned int page_order(struct page *page)
 {
 	/* PageBuddy() must be checked by the caller */
 	return page_private(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a56ad53ff553..81cfaca6b9fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -163,7 +163,7 @@ bool pm_suspended_storage(void)
 #endif /* CONFIG_PM_SLEEP */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
-int pageblock_order __read_mostly;
+unsigned int pageblock_order __read_mostly;
 #endif
 
 static void __free_pages_ok(struct page *page, unsigned int order);
@@ -441,7 +441,7 @@ static void free_compound_page(struct page *page)
 	__free_pages_ok(page, compound_order(page));
 }
 
-void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -641,7 +641,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_idx;
 	unsigned long uninitialized_var(buddy_idx);
 	struct page *buddy;
-	int max_order = MAX_ORDER;
+	unsigned int max_order = MAX_ORDER;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1444,7 +1444,7 @@ int move_freepages(struct zone *zone,
 			  int migratetype)
 {
 	struct page *page;
-	unsigned long order;
+	unsigned int order;
 	int pages_moved = 0;
 
 #ifndef CONFIG_HOLES_IN_ZONE
@@ -1558,7 +1558,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 static void steal_suitable_fallback(struct zone *zone, struct page *page,
 							  int start_type)
 {
-	int current_order = page_order(page);
+	unsigned int current_order = page_order(page);
 	int pages;
 
 	/* Take ownership for orders >= pageblock_order */
@@ -2665,7 +2665,7 @@ static DEFINE_RATELIMIT_STATE(nopage_rs,
 		DEFAULT_RATELIMIT_INTERVAL,
 		DEFAULT_RATELIMIT_BURST);
 
-void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
+void warn_alloc_failed(gfp_t gfp_mask, unsigned int order, const char *fmt, ...)
 {
 	unsigned int filter = SHOW_MEM_FILTER_NODES;
 
@@ -2699,7 +2699,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		va_end(args);
 	}
 
-	pr_warn("%s: page allocation failure: order:%d, mode:0x%x\n",
+	pr_warn("%s: page allocation failure: order:%u, mode:0x%x\n",
 		current->comm, order, gfp_mask);
 
 	dump_stack();
@@ -3458,7 +3458,8 @@ void free_kmem_pages(unsigned long addr, unsigned int order)
 	}
 }
 
-static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
+static void *make_alloc_exact(unsigned long addr, unsigned int order,
+		size_t size)
 {
 	if (addr) {
 		unsigned long alloc_end = addr + (PAGE_SIZE << order);
@@ -3510,7 +3511,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
  */
 void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
 {
-	unsigned order = get_order(size);
+	unsigned int order = get_order(size);
 	struct page *p = alloc_pages_node(nid, gfp_mask, order);
 	if (!p)
 		return NULL;
@@ -3812,7 +3813,8 @@ void show_free_areas(unsigned int filter)
 	}
 
 	for_each_populated_zone(zone) {
-		unsigned long nr[MAX_ORDER], flags, order, total = 0;
+		unsigned int order;
+		unsigned long nr[MAX_ORDER], flags, total = 0;
 		unsigned char types[MAX_ORDER];
 
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -4161,7 +4163,7 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
-	int order = current_zonelist_order;
+	unsigned int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -6826,7 +6828,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype)
 {
 	unsigned long outer_start, outer_end;
-	int ret = 0, order;
+	unsigned int order;
+	int ret = 0;
 
 	struct compact_control cc = {
 		.nr_migratepages = 0,
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 7/7] mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT
  2015-09-03 12:35 ` Kirill A. Shutemov
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

On 64 bit system we have enough space in struct page to encode
compound_dtor and compound_order with unsigned int.

On x86-64 it leads to slightly smaller code size due usesage of plain
MOV instead of MOVZX (zero-extended move) or similar effect.

allyesconfig:

   text	   data	    bss	    dec	    hex	filename
159520446	48146736	72196096	279863278	10ae5fee	vmlinux.pre
159520382	48146736	72196096	279863214	10ae5fae	vmlinux.post

On other architectures without native support of 16-bit data types the
difference can be bigger.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ecaf3b1d0216..39b0db74ba5e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -150,8 +150,13 @@ struct page {
 		/* First tail page of compound page */
 		struct {
 			unsigned long compound_head; /* If bit zero is set */
+#ifdef CONFIG_64BIT
+			unsigned int compound_dtor;
+			unsigned int compound_order;
+#else
 			unsigned short int compound_dtor;
 			unsigned short int compound_order;
+#endif
 		};
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCHv5 7/7] mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT
@ 2015-09-03 12:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-03 12:35 UTC (permalink / raw)
  To: Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Kirill A. Shutemov

On 64 bit system we have enough space in struct page to encode
compound_dtor and compound_order with unsigned int.

On x86-64 it leads to slightly smaller code size due usesage of plain
MOV instead of MOVZX (zero-extended move) or similar effect.

allyesconfig:

   text	   data	    bss	    dec	    hex	filename
159520446	48146736	72196096	279863278	10ae5fee	vmlinux.pre
159520382	48146736	72196096	279863214	10ae5fae	vmlinux.post

On other architectures without native support of 16-bit data types the
difference can be bigger.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ecaf3b1d0216..39b0db74ba5e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -150,8 +150,13 @@ struct page {
 		/* First tail page of compound page */
 		struct {
 			unsigned long compound_head; /* If bit zero is set */
+#ifdef CONFIG_64BIT
+			unsigned int compound_dtor;
+			unsigned int compound_order;
+#else
 			unsigned short int compound_dtor;
 			unsigned short int compound_order;
+#endif
 		};
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 0/7] Fix compound_head() race
  2015-09-03 12:35 ` Kirill A. Shutemov
@ 2015-09-04 13:43   ` Andrea Arcangeli
  -1 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2015-09-04 13:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Hugh Dickins, Dave Hansen, Vlastimil Babka,
	Johannes Weiner, Michal Hocko, David Rientjes, Aneesh Kumar K.V,
	linux-kernel, linux-mm

On Thu, Sep 03, 2015 at 03:35:51PM +0300, Kirill A. Shutemov wrote:
> Kirill A. Shutemov (7):
>   mm: drop page->slab_page
>   slub: use page->rcu_head instead of page->lru plus cast
>   zsmalloc: use page->private instead of page->first_page
>   mm: pack compound_dtor and compound_order into one word in struct page
>   mm: make compound_head() robust
>   mm: use 'unsigned int' for page order
>   mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

The only other alternative solution that doesn't require finding a bit
zero at the LSB in a field unused in tail pages, is to drop both
PG_head and PG_tail, and reserve 4 bits from page->flags.

This means a net loss of 2 bits from page->flags (and loss of 3 bits
if !CONFIG_PAGEFLAGS_EXTENDED), but then everything becomes simple and
there's no need of finding a LSB field that is guaranteed zero at all
times.

With those 4 bits, you clear them for not compound pages. When you
create a compound page you encode the compound_order in those 4 bits
of page->flags, equal for for all head and tail
pages. compound_order() then becomes atomically available for tail
pages too and compound_order goes away from struct page along with
first_page (and there's no need to add a compound_head).

In PageCompound you read the 4 bits, if they're not all zero it's
compound, otherwise it's not.

In PageHead/Tail, if the 4 bits are all zero it's not head/tail,
otherwise you do the math on the page_to_pfn(page). If the pfn is
naturally aligned against the order encoded in the 4 bits "!(pfn &
(1<<order)-1)" it's a head, otherwise it's a tail.

If it's a tail, for the compound_head then it's just a matter of doing
"return page - (pfn & ((1<<order)-1)" (no need of pfn_to_page).

This leverages the physical natural alignment of compound pages for
all orders. It'd cover up to CONFIG_FORCE_MAX_ZONEORDER == 16
(128MBytes of order 15 with PAGE_SIZE 4kb).

page_to_pfn can actually be replaced with
(&NODE_DATA(page_to_nid(page))->node_mem_map-page) which is faster as
page_to_nid only need to accesses page->flags which is already in
L1. So then it costs only one cacheline access in the pgdat and a sub.

Because of the two (or three) additional bits taken out of page->flags
I doubt it's viable on 32bit, but I thought I'd mention it just in case.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 0/7] Fix compound_head() race
@ 2015-09-04 13:43   ` Andrea Arcangeli
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2015-09-04 13:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Hugh Dickins, Dave Hansen, Vlastimil Babka,
	Johannes Weiner, Michal Hocko, David Rientjes, Aneesh Kumar K.V,
	linux-kernel, linux-mm

On Thu, Sep 03, 2015 at 03:35:51PM +0300, Kirill A. Shutemov wrote:
> Kirill A. Shutemov (7):
>   mm: drop page->slab_page
>   slub: use page->rcu_head instead of page->lru plus cast
>   zsmalloc: use page->private instead of page->first_page
>   mm: pack compound_dtor and compound_order into one word in struct page
>   mm: make compound_head() robust
>   mm: use 'unsigned int' for page order
>   mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

The only other alternative solution that doesn't require finding a bit
zero at the LSB in a field unused in tail pages, is to drop both
PG_head and PG_tail, and reserve 4 bits from page->flags.

This means a net loss of 2 bits from page->flags (and loss of 3 bits
if !CONFIG_PAGEFLAGS_EXTENDED), but then everything becomes simple and
there's no need of finding a LSB field that is guaranteed zero at all
times.

With those 4 bits, you clear them for not compound pages. When you
create a compound page you encode the compound_order in those 4 bits
of page->flags, equal for for all head and tail
pages. compound_order() then becomes atomically available for tail
pages too and compound_order goes away from struct page along with
first_page (and there's no need to add a compound_head).

In PageCompound you read the 4 bits, if they're not all zero it's
compound, otherwise it's not.

In PageHead/Tail, if the 4 bits are all zero it's not head/tail,
otherwise you do the math on the page_to_pfn(page). If the pfn is
naturally aligned against the order encoded in the 4 bits "!(pfn &
(1<<order)-1)" it's a head, otherwise it's a tail.

If it's a tail, for the compound_head then it's just a matter of doing
"return page - (pfn & ((1<<order)-1)" (no need of pfn_to_page).

This leverages the physical natural alignment of compound pages for
all orders. It'd cover up to CONFIG_FORCE_MAX_ZONEORDER == 16
(128MBytes of order 15 with PAGE_SIZE 4kb).

page_to_pfn can actually be replaced with
(&NODE_DATA(page_to_nid(page))->node_mem_map-page) which is faster as
page_to_nid only need to accesses page->flags which is already in
L1. So then it costs only one cacheline access in the pgdat and a sub.

Because of the two (or three) additional bits taken out of page->flags
I doubt it's viable on 32bit, but I thought I'd mention it just in case.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 1/7] mm: drop page->slab_page
  2015-09-03 12:35   ` Kirill A. Shutemov
@ 2015-09-07  5:00     ` Alexander Duyck
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2015-09-07  5:00 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Joonsoo Kim, Andi Kleen

On 09/03/2015 05:35 AM, Kirill A. Shutemov wrote:
> Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> uses slab_page field in struct page.
>
> Let's drop it.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> ---
>   include/linux/mm_types.h |  1 -
>   mm/slab.c                | 17 +++--------------
>   2 files changed, 3 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0038ac7466fd..58620ac7f15c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -140,7 +140,6 @@ struct page {
>   #endif
>   		};
>   
> -		struct slab *slab_page; /* slab fields */
>   		struct rcu_head rcu_head;	/* Used by SLAB
>   						 * when destroying via RCU
>   						 */
> diff --git a/mm/slab.c b/mm/slab.c
> index 200e22412a16..649044f26e5d 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1888,21 +1888,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
>   
>   	freelist = page->freelist;
>   	slab_destroy_debugcheck(cachep, page);
> -	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
> -		struct rcu_head *head;
> -
> -		/*
> -		 * RCU free overloads the RCU head over the LRU.
> -		 * slab_page has been overloeaded over the LRU,
> -		 * however it is not used from now on so that
> -		 * we can use it safely.
> -		 */
> -		head = (void *)&page->rcu_head;
> -		call_rcu(head, kmem_rcu_free);
> -
> -	} else {
> +	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
> +		call_rcu(&page->rcu_head, kmem_rcu_free);
> +	else
>   		kmem_freepages(cachep, page);
> -	}
>   
>   	/*
>   	 * From now on, we don't use freelist

This second piece looks like it belongs in patch 2, not patch 1 based on 
the descriptions.

- Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 1/7] mm: drop page->slab_page
@ 2015-09-07  5:00     ` Alexander Duyck
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2015-09-07  5:00 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Joonsoo Kim, Andi Kleen

On 09/03/2015 05:35 AM, Kirill A. Shutemov wrote:
> Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> uses slab_page field in struct page.
>
> Let's drop it.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> ---
>   include/linux/mm_types.h |  1 -
>   mm/slab.c                | 17 +++--------------
>   2 files changed, 3 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0038ac7466fd..58620ac7f15c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -140,7 +140,6 @@ struct page {
>   #endif
>   		};
>   
> -		struct slab *slab_page; /* slab fields */
>   		struct rcu_head rcu_head;	/* Used by SLAB
>   						 * when destroying via RCU
>   						 */
> diff --git a/mm/slab.c b/mm/slab.c
> index 200e22412a16..649044f26e5d 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1888,21 +1888,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
>   
>   	freelist = page->freelist;
>   	slab_destroy_debugcheck(cachep, page);
> -	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
> -		struct rcu_head *head;
> -
> -		/*
> -		 * RCU free overloads the RCU head over the LRU.
> -		 * slab_page has been overloeaded over the LRU,
> -		 * however it is not used from now on so that
> -		 * we can use it safely.
> -		 */
> -		head = (void *)&page->rcu_head;
> -		call_rcu(head, kmem_rcu_free);
> -
> -	} else {
> +	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
> +		call_rcu(&page->rcu_head, kmem_rcu_free);
> +	else
>   		kmem_freepages(cachep, page);
> -	}
>   
>   	/*
>   	 * From now on, we don't use freelist

This second piece looks like it belongs in patch 2, not patch 1 based on 
the descriptions.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 1/7] mm: drop page->slab_page
  2015-09-07  5:00     ` Alexander Duyck
@ 2015-09-07 12:19       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-07 12:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Joonsoo Kim, Andi Kleen

On Sun, Sep 06, 2015 at 10:00:57PM -0700, Alexander Duyck wrote:
> On 09/03/2015 05:35 AM, Kirill A. Shutemov wrote:
> >Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> >uses slab_page field in struct page.
> >
> >Let's drop it.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Acked-by: Christoph Lameter <cl@linux.com>
> >Acked-by: David Rientjes <rientjes@google.com>
> >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >Cc: Andi Kleen <ak@linux.intel.com>
> >---
> >  include/linux/mm_types.h |  1 -
> >  mm/slab.c                | 17 +++--------------
> >  2 files changed, 3 insertions(+), 15 deletions(-)
> >
> >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >index 0038ac7466fd..58620ac7f15c 100644
> >--- a/include/linux/mm_types.h
> >+++ b/include/linux/mm_types.h
> >@@ -140,7 +140,6 @@ struct page {
> >  #endif
> >  		};
> >-		struct slab *slab_page; /* slab fields */
> >  		struct rcu_head rcu_head;	/* Used by SLAB
> >  						 * when destroying via RCU
> >  						 */
> >diff --git a/mm/slab.c b/mm/slab.c
> >index 200e22412a16..649044f26e5d 100644
> >--- a/mm/slab.c
> >+++ b/mm/slab.c
> >@@ -1888,21 +1888,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
> >  	freelist = page->freelist;
> >  	slab_destroy_debugcheck(cachep, page);
> >-	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
> >-		struct rcu_head *head;
> >-
> >-		/*
> >-		 * RCU free overloads the RCU head over the LRU.
> >-		 * slab_page has been overloeaded over the LRU,
> >-		 * however it is not used from now on so that
> >-		 * we can use it safely.
> >-		 */
> >-		head = (void *)&page->rcu_head;
> >-		call_rcu(head, kmem_rcu_free);
> >-
> >-	} else {
> >+	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
> >+		call_rcu(&page->rcu_head, kmem_rcu_free);
> >+	else
> >  		kmem_freepages(cachep, page);
> >-	}
> >  	/*
> >  	 * From now on, we don't use freelist
> 
> This second piece looks like it belongs in patch 2, not patch 1 based on the
> descriptions.

You're right.

Although I don't think I would re-spin the patchset just for this change.
If any other change would be required, I'll fix this too.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 1/7] mm: drop page->slab_page
@ 2015-09-07 12:19       ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-07 12:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, David Rientjes, Aneesh Kumar K.V, linux-kernel,
	linux-mm, Joonsoo Kim, Andi Kleen

On Sun, Sep 06, 2015 at 10:00:57PM -0700, Alexander Duyck wrote:
> On 09/03/2015 05:35 AM, Kirill A. Shutemov wrote:
> >Since 8456a648cf44 ("slab: use struct page for slab management") nobody
> >uses slab_page field in struct page.
> >
> >Let's drop it.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Acked-by: Christoph Lameter <cl@linux.com>
> >Acked-by: David Rientjes <rientjes@google.com>
> >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >Cc: Andi Kleen <ak@linux.intel.com>
> >---
> >  include/linux/mm_types.h |  1 -
> >  mm/slab.c                | 17 +++--------------
> >  2 files changed, 3 insertions(+), 15 deletions(-)
> >
> >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >index 0038ac7466fd..58620ac7f15c 100644
> >--- a/include/linux/mm_types.h
> >+++ b/include/linux/mm_types.h
> >@@ -140,7 +140,6 @@ struct page {
> >  #endif
> >  		};
> >-		struct slab *slab_page; /* slab fields */
> >  		struct rcu_head rcu_head;	/* Used by SLAB
> >  						 * when destroying via RCU
> >  						 */
> >diff --git a/mm/slab.c b/mm/slab.c
> >index 200e22412a16..649044f26e5d 100644
> >--- a/mm/slab.c
> >+++ b/mm/slab.c
> >@@ -1888,21 +1888,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
> >  	freelist = page->freelist;
> >  	slab_destroy_debugcheck(cachep, page);
> >-	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
> >-		struct rcu_head *head;
> >-
> >-		/*
> >-		 * RCU free overloads the RCU head over the LRU.
> >-		 * slab_page has been overloeaded over the LRU,
> >-		 * however it is not used from now on so that
> >-		 * we can use it safely.
> >-		 */
> >-		head = (void *)&page->rcu_head;
> >-		call_rcu(head, kmem_rcu_free);
> >-
> >-	} else {
> >+	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
> >+		call_rcu(&page->rcu_head, kmem_rcu_free);
> >+	else
> >  		kmem_freepages(cachep, page);
> >-	}
> >  	/*
> >  	 * From now on, we don't use freelist
> 
> This second piece looks like it belongs in patch 2, not patch 1 based on the
> descriptions.

You're right.

Although I don't think I would re-spin the patchset just for this change.
If any other change would be required, I'll fix this too.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 4/7] mm: pack compound_dtor and compound_order into one word in struct page
  2015-09-03 12:35   ` Kirill A. Shutemov
@ 2015-09-10  9:41     ` Vlastimil Babka
  -1 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-10  9:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm

On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> The patch halves space occupied by compound_dtor and compound_order in
> struct page.
> 
> For compound_order, it's trivial long -> short conversion.
> 
> For get_compound_page_dtor(), we now use hardcoded table for destructor
> lookup and store its index in the struct page instead of direct pointer
> to destructor. It shouldn't be a big trouble to maintain the table: we
> have only two destructor and NULL currently.
> 
> This patch free up one word in tail pages for reuse. This is preparation
> for the next patch.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 4/7] mm: pack compound_dtor and compound_order into one word in struct page
@ 2015-09-10  9:41     ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-10  9:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm

On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> The patch halves space occupied by compound_dtor and compound_order in
> struct page.
> 
> For compound_order, it's trivial long -> short conversion.
> 
> For get_compound_page_dtor(), we now use hardcoded table for destructor
> lookup and store its index in the struct page instead of direct pointer
> to destructor. It shouldn't be a big trouble to maintain the table: we
> have only two destructor and NULL currently.
> 
> This patch free up one word in tail pages for reuse. This is preparation
> for the next patch.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 5/7] mm: make compound_head() robust
  2015-09-03 12:35   ` Kirill A. Shutemov
@ 2015-09-10 10:54     ` Vlastimil Babka
  -1 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-10 10:54 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm,
	Paul E. McKenney

On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
> 	CPU0					CPU1
> 
> isolate_migratepages_block()
>   page_count()
>     compound_head()
>       !!PageTail() == true
> 					put_page()
> 					  tail->first_page = NULL
>       head = tail->first_page
> 					alloc_pages(__GFP_COMP)
> 					   prep_compound_page()
> 					     tail->first_page = head
> 					     __SetPageTail(p);
>       !!PageTail() == true
>     <head == NULL dereferencing>
> 
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
> 
> We can fix the race by changing how encode PageTail() and compound_head()
> within struct page to be able to update them in one shot.
> 
> The patch introduces page->compound_head into third double word block in
> front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
> the rest bits are pointer to head page if bit zero is set.
> 
> The patch moves page->pmd_huge_pte out of word, just in case if an
> architecture defines pgtable_t into something what can have the bit 0
> set.
> 
> hugetlb_cgroup uses page->lru.next in the second tail page to store
> pointer struct hugetlb_cgroup. The patch switch it to use page->private
> in the second tail page instead. The space is free since ->first_page is
> removed from the union.
> 
> The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
> limitation, since there's now space in first tail page to store struct
> hugetlb_cgroup pointer. But that's out of scope of the patch.
> 
> That means page->compound_head shares storage space with:
> 
>  - page->lru.next;
>  - page->next;
>  - page->rcu_head.next;
> 
> That's too long list to be absolutely sure, but looks like nobody uses
> bit 0 of the word.

Given the discussion about rcu_head, that should warrant some summary here :)

> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -120,7 +120,13 @@ struct page {
>  		};
>  	};
>  
> -	/* Third double word block */
> +	/*
> +	 * Third double word block
> +	 *
> +	 * WARNING: bit 0 of the first word encode PageTail(). That means
> +	 * the rest users of the storage space MUST NOT use the bit to
> +	 * avoid collision and false-positive PageTail().
> +	 */
>  	union {
>  		struct list_head lru;	/* Pageout list, eg. active_list
>  					 * protected by zone->lru_lock !
> @@ -143,12 +149,19 @@ struct page {
>  						 */
>  		/* First tail page of compound page */

"First tail" doesn't apply for compound_head.

> index 097c7a4bfbd9..330377f83ac7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
>  				      (1L << PG_unevictable)));
>  		page_tail->flags |= (1L << PG_dirty);
>  
> -		/* clear PageTail before overwriting first_page */
> -		smp_wmb();
> +		clear_compound_head(page_tail);

I would sleep better if this was done before setting all the page->flags above,
previously, PageTail was cleared by the first operation which is
"page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;"
I do realize that it doesn't use WRITE_ONCE, so it might have been theoretically
broken already, if it does matter.

> diff --git a/mm/internal.h b/mm/internal.h
> index 36b23f1e2ca6..89e21a07080a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
>  	 * speculative page access (like in
>  	 * page_cache_get_speculative()) on tail pages.
>  	 */
> -	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
> +	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
>  	if (get_page_head)
> -		atomic_inc(&page->first_page->_count);
> +		atomic_inc(&compound_head(page)->_count);

Doing another compound_head() seems like overkill when this code already assumes
PageTail. All callers do it after if (PageTail()) which means they already did
READ_ONCE(page->compound_head) and here they do another one. Potentially with
different result in bit 0, which would be a subtle bug, that could be
interesting to catch with some VM_BUG_ON. I don't know if a direct plain
page->compound_head access here instead of compound_head() would also result in
a re-read, since the previous access did use READ_ONCE(). Maybe it would be best
to reorganize the code here and in the 3 call sites so that the READ_ONCE() used
to determine PageTail also obtains the compound head pointer.

Some of that is probably made moot by your other series, but better let's think
of this series as standalone first.

>  	get_huge_page_tail(page);
>  }
>  
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1f4446a90cef..4d1a5de9653d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
>  #define lru		(1UL << PG_lru)
>  #define swapbacked	(1UL << PG_swapbacked)
>  #define head		(1UL << PG_head)
> -#define tail		(1UL << PG_tail)
> -#define compound	(1UL << PG_compound)
>  #define slab		(1UL << PG_slab)
>  #define reserved	(1UL << PG_reserved)
>  
> @@ -811,12 +809,7 @@ static struct page_state {
>  	 */
>  	{ slab,		slab,		MF_MSG_SLAB,	me_kernel },
>  
> -#ifdef CONFIG_PAGEFLAGS_EXTENDED
>  	{ head,		head,		MF_MSG_HUGE,		me_huge_page },
> -	{ tail,		tail,		MF_MSG_HUGE,		me_huge_page },
> -#else
> -	{ compound,	compound,	MF_MSG_HUGE,		me_huge_page },
> -#endif
>  
>  	{ sc|dirty,	sc|dirty,	MF_MSG_DIRTY_SWAPCACHE,	me_swapcache_dirty },
>  	{ sc|dirty,	sc,		MF_MSG_CLEAN_SWAPCACHE,	me_swapcache_clean },
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c6733cc3cbce..a56ad53ff553 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -424,15 +424,15 @@ out:
>  /*
>   * Higher-order pages are called "compound pages".  They are structured thusly:
>   *
> - * The first PAGE_SIZE page is called the "head page".
> + * The first PAGE_SIZE page is called the "head page" and have PG_head set.
>   *
> - * The remaining PAGE_SIZE pages are called "tail pages".
> + * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
> + * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
>   *
> - * All pages have PG_compound set.  All tail pages have their ->first_page
> - * pointing at the head page.
> + * The first tail page's ->compound_dtor holds the offset in array of compound
> + * page destructors. See compound_page_dtors.
>   *
> - * The first tail page's ->lru.next holds the address of the compound page's
> - * put_page() function.  Its ->lru.prev holds the order of allocation.
> + * The first tail page's ->compound_order holds the order of allocation.
>   * This usage means that zero-order pages may not be compound.
>   */
>  
> @@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
>  	for (i = 1; i < nr_pages; i++) {
>  		struct page *p = page + i;
>  		set_page_count(p, 0);
> -		p->first_page = page;
> -		/* Make sure p->first_page is always valid for PageTail() */
> -		smp_wmb();
> -		__SetPageTail(p);
> +		set_compound_head(p, page);
>  	}
>  }
>  
> @@ -830,17 +827,30 @@ static void free_one_page(struct zone *zone,
>  
>  static int free_tail_pages_check(struct page *head_page, struct page *page)
>  {
> -	if (!IS_ENABLED(CONFIG_DEBUG_VM))
> -		return 0;
> +	int ret = 1;
> +
> +	/*
> +	 * We rely page->lru.next never has bit 0 set, unless the page
> +	 * is PageTail(). Let's make sure that's true even for poisoned ->lru.
> +	 */
> +	BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
> +
> +	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
> +		ret = 0;
> +		goto out;
> +	}
>  	if (unlikely(!PageTail(page))) {
>  		bad_page(page, "PageTail not set", 0);
> -		return 1;
> +		goto out;
>  	}
> -	if (unlikely(page->first_page != head_page)) {
> -		bad_page(page, "first_page not consistent", 0);
> -		return 1;
> +	if (unlikely(compound_head(page) != head_page)) {
> +		bad_page(page, "compound_head not consistent", 0);
> +		goto out;

Same here, although for a DEBUG_VM config only it's not as important.

>  	}
> -	return 0;
> +	ret = 0;
> +out:
> +	clear_compound_head(page);
> +	return ret;
>  }
>  
>  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> @@ -888,6 +898,8 @@ static void init_reserved_page(unsigned long pfn)
>  #else
>  static inline void init_reserved_page(unsigned long pfn)
>  {
> +	/* Avoid false-positive PageTail() */
> +	INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
>  }
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
> diff --git a/mm/swap.c b/mm/swap.c
> index a3a0a2f1f7c3..faa9e1687dea 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -200,7 +200,7 @@ out_put_single:
>  				__put_single_page(page);
>  			return;
>  		}
> -		VM_BUG_ON_PAGE(page_head != page->first_page, page);
> +		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
>  		/*
>  		 * We can release the refcount taken by
>  		 * get_page_unless_zero() now that
> @@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
>  	 *  Case 3 is possible, as we may race with
>  	 *  __split_huge_page_refcount tearing down a THP page.
>  	 */
> -	page_head = compound_head_by_tail(page);
> +	page_head = compound_head(page);

This is also in a path after PageTail() is true.

>  	if (!__compound_tail_refcounted(page_head))
>  		put_unrefcounted_compound_page(page_head, page);
>  	else
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 5/7] mm: make compound_head() robust
@ 2015-09-10 10:54     ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-10 10:54 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm,
	Paul E. McKenney

On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
> 	CPU0					CPU1
> 
> isolate_migratepages_block()
>   page_count()
>     compound_head()
>       !!PageTail() == true
> 					put_page()
> 					  tail->first_page = NULL
>       head = tail->first_page
> 					alloc_pages(__GFP_COMP)
> 					   prep_compound_page()
> 					     tail->first_page = head
> 					     __SetPageTail(p);
>       !!PageTail() == true
>     <head == NULL dereferencing>
> 
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
> 
> We can fix the race by changing how encode PageTail() and compound_head()
> within struct page to be able to update them in one shot.
> 
> The patch introduces page->compound_head into third double word block in
> front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
> the rest bits are pointer to head page if bit zero is set.
> 
> The patch moves page->pmd_huge_pte out of word, just in case if an
> architecture defines pgtable_t into something what can have the bit 0
> set.
> 
> hugetlb_cgroup uses page->lru.next in the second tail page to store
> pointer struct hugetlb_cgroup. The patch switch it to use page->private
> in the second tail page instead. The space is free since ->first_page is
> removed from the union.
> 
> The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
> limitation, since there's now space in first tail page to store struct
> hugetlb_cgroup pointer. But that's out of scope of the patch.
> 
> That means page->compound_head shares storage space with:
> 
>  - page->lru.next;
>  - page->next;
>  - page->rcu_head.next;
> 
> That's too long list to be absolutely sure, but looks like nobody uses
> bit 0 of the word.

Given the discussion about rcu_head, that should warrant some summary here :)

> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -120,7 +120,13 @@ struct page {
>  		};
>  	};
>  
> -	/* Third double word block */
> +	/*
> +	 * Third double word block
> +	 *
> +	 * WARNING: bit 0 of the first word encode PageTail(). That means
> +	 * the rest users of the storage space MUST NOT use the bit to
> +	 * avoid collision and false-positive PageTail().
> +	 */
>  	union {
>  		struct list_head lru;	/* Pageout list, eg. active_list
>  					 * protected by zone->lru_lock !
> @@ -143,12 +149,19 @@ struct page {
>  						 */
>  		/* First tail page of compound page */

"First tail" doesn't apply for compound_head.

> index 097c7a4bfbd9..330377f83ac7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
>  				      (1L << PG_unevictable)));
>  		page_tail->flags |= (1L << PG_dirty);
>  
> -		/* clear PageTail before overwriting first_page */
> -		smp_wmb();
> +		clear_compound_head(page_tail);

I would sleep better if this was done before setting all the page->flags above,
previously, PageTail was cleared by the first operation which is
"page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;"
I do realize that it doesn't use WRITE_ONCE, so it might have been theoretically
broken already, if it does matter.

> diff --git a/mm/internal.h b/mm/internal.h
> index 36b23f1e2ca6..89e21a07080a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
>  	 * speculative page access (like in
>  	 * page_cache_get_speculative()) on tail pages.
>  	 */
> -	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
> +	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
>  	if (get_page_head)
> -		atomic_inc(&page->first_page->_count);
> +		atomic_inc(&compound_head(page)->_count);

Doing another compound_head() seems like overkill when this code already assumes
PageTail. All callers do it after if (PageTail()) which means they already did
READ_ONCE(page->compound_head) and here they do another one. Potentially with
different result in bit 0, which would be a subtle bug, that could be
interesting to catch with some VM_BUG_ON. I don't know if a direct plain
page->compound_head access here instead of compound_head() would also result in
a re-read, since the previous access did use READ_ONCE(). Maybe it would be best
to reorganize the code here and in the 3 call sites so that the READ_ONCE() used
to determine PageTail also obtains the compound head pointer.

Some of that is probably made moot by your other series, but better let's think
of this series as standalone first.

>  	get_huge_page_tail(page);
>  }
>  
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1f4446a90cef..4d1a5de9653d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
>  #define lru		(1UL << PG_lru)
>  #define swapbacked	(1UL << PG_swapbacked)
>  #define head		(1UL << PG_head)
> -#define tail		(1UL << PG_tail)
> -#define compound	(1UL << PG_compound)
>  #define slab		(1UL << PG_slab)
>  #define reserved	(1UL << PG_reserved)
>  
> @@ -811,12 +809,7 @@ static struct page_state {
>  	 */
>  	{ slab,		slab,		MF_MSG_SLAB,	me_kernel },
>  
> -#ifdef CONFIG_PAGEFLAGS_EXTENDED
>  	{ head,		head,		MF_MSG_HUGE,		me_huge_page },
> -	{ tail,		tail,		MF_MSG_HUGE,		me_huge_page },
> -#else
> -	{ compound,	compound,	MF_MSG_HUGE,		me_huge_page },
> -#endif
>  
>  	{ sc|dirty,	sc|dirty,	MF_MSG_DIRTY_SWAPCACHE,	me_swapcache_dirty },
>  	{ sc|dirty,	sc,		MF_MSG_CLEAN_SWAPCACHE,	me_swapcache_clean },
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c6733cc3cbce..a56ad53ff553 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -424,15 +424,15 @@ out:
>  /*
>   * Higher-order pages are called "compound pages".  They are structured thusly:
>   *
> - * The first PAGE_SIZE page is called the "head page".
> + * The first PAGE_SIZE page is called the "head page" and have PG_head set.
>   *
> - * The remaining PAGE_SIZE pages are called "tail pages".
> + * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
> + * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
>   *
> - * All pages have PG_compound set.  All tail pages have their ->first_page
> - * pointing at the head page.
> + * The first tail page's ->compound_dtor holds the offset in array of compound
> + * page destructors. See compound_page_dtors.
>   *
> - * The first tail page's ->lru.next holds the address of the compound page's
> - * put_page() function.  Its ->lru.prev holds the order of allocation.
> + * The first tail page's ->compound_order holds the order of allocation.
>   * This usage means that zero-order pages may not be compound.
>   */
>  
> @@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
>  	for (i = 1; i < nr_pages; i++) {
>  		struct page *p = page + i;
>  		set_page_count(p, 0);
> -		p->first_page = page;
> -		/* Make sure p->first_page is always valid for PageTail() */
> -		smp_wmb();
> -		__SetPageTail(p);
> +		set_compound_head(p, page);
>  	}
>  }
>  
> @@ -830,17 +827,30 @@ static void free_one_page(struct zone *zone,
>  
>  static int free_tail_pages_check(struct page *head_page, struct page *page)
>  {
> -	if (!IS_ENABLED(CONFIG_DEBUG_VM))
> -		return 0;
> +	int ret = 1;
> +
> +	/*
> +	 * We rely page->lru.next never has bit 0 set, unless the page
> +	 * is PageTail(). Let's make sure that's true even for poisoned ->lru.
> +	 */
> +	BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
> +
> +	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
> +		ret = 0;
> +		goto out;
> +	}
>  	if (unlikely(!PageTail(page))) {
>  		bad_page(page, "PageTail not set", 0);
> -		return 1;
> +		goto out;
>  	}
> -	if (unlikely(page->first_page != head_page)) {
> -		bad_page(page, "first_page not consistent", 0);
> -		return 1;
> +	if (unlikely(compound_head(page) != head_page)) {
> +		bad_page(page, "compound_head not consistent", 0);
> +		goto out;

Same here, although for a DEBUG_VM config only it's not as important.

>  	}
> -	return 0;
> +	ret = 0;
> +out:
> +	clear_compound_head(page);
> +	return ret;
>  }
>  
>  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> @@ -888,6 +898,8 @@ static void init_reserved_page(unsigned long pfn)
>  #else
>  static inline void init_reserved_page(unsigned long pfn)
>  {
> +	/* Avoid false-positive PageTail() */
> +	INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
>  }
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
> diff --git a/mm/swap.c b/mm/swap.c
> index a3a0a2f1f7c3..faa9e1687dea 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -200,7 +200,7 @@ out_put_single:
>  				__put_single_page(page);
>  			return;
>  		}
> -		VM_BUG_ON_PAGE(page_head != page->first_page, page);
> +		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
>  		/*
>  		 * We can release the refcount taken by
>  		 * get_page_unless_zero() now that
> @@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
>  	 *  Case 3 is possible, as we may race with
>  	 *  __split_huge_page_refcount tearing down a THP page.
>  	 */
> -	page_head = compound_head_by_tail(page);
> +	page_head = compound_head(page);

This is also in a path after PageTail() is true.

>  	if (!__compound_tail_refcounted(page_head))
>  		put_unrefcounted_compound_page(page_head, page);
>  	else
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 6/7] mm: use 'unsigned int' for page order
  2015-09-03 12:35   ` Kirill A. Shutemov
@ 2015-09-10 11:04     ` Vlastimil Babka
  -1 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-10 11:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm

On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> Let's try to be consistent about data type of page order.

Long overdue I'd say :) Thanks.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 6/7] mm: use 'unsigned int' for page order
@ 2015-09-10 11:04     ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-10 11:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm

On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> Let's try to be consistent about data type of page order.

Long overdue I'd say :) Thanks.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 7/7] mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT
  2015-09-03 12:35   ` Kirill A. Shutemov
@ 2015-09-10 11:34     ` Vlastimil Babka
  -1 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-10 11:34 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm

On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> On 64 bit system we have enough space in struct page to encode
> compound_dtor and compound_order with unsigned int.
> 
> On x86-64 it leads to slightly smaller code size due usesage of plain
> MOV instead of MOVZX (zero-extended move) or similar effect.
> 
> allyesconfig:
> 
>    text	   data	    bss	    dec	    hex	filename
> 159520446	48146736	72196096	279863278	10ae5fee	vmlinux.pre
> 159520382	48146736	72196096	279863214	10ae5fae	vmlinux.post
> 
> On other architectures without native support of 16-bit data types the
> difference can be bigger.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mm_types.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index ecaf3b1d0216..39b0db74ba5e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -150,8 +150,13 @@ struct page {
>  		/* First tail page of compound page */
>  		struct {
>  			unsigned long compound_head; /* If bit zero is set */

I'm indifferent to this change. But some comment here to explain would avoid git
blame to figure out why it's done?

> +#ifdef CONFIG_64BIT
> +			unsigned int compound_dtor;
> +			unsigned int compound_order;
> +#else
>  			unsigned short int compound_dtor;
>  			unsigned short int compound_order;
> +#endif
>  		};
>  
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 7/7] mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT
@ 2015-09-10 11:34     ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-10 11:34 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins
  Cc: Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm

On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> On 64 bit system we have enough space in struct page to encode
> compound_dtor and compound_order with unsigned int.
> 
> On x86-64 it leads to slightly smaller code size due usesage of plain
> MOV instead of MOVZX (zero-extended move) or similar effect.
> 
> allyesconfig:
> 
>    text	   data	    bss	    dec	    hex	filename
> 159520446	48146736	72196096	279863278	10ae5fee	vmlinux.pre
> 159520382	48146736	72196096	279863214	10ae5fae	vmlinux.post
> 
> On other architectures without native support of 16-bit data types the
> difference can be bigger.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mm_types.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index ecaf3b1d0216..39b0db74ba5e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -150,8 +150,13 @@ struct page {
>  		/* First tail page of compound page */
>  		struct {
>  			unsigned long compound_head; /* If bit zero is set */

I'm indifferent to this change. But some comment here to explain would avoid git
blame to figure out why it's done?

> +#ifdef CONFIG_64BIT
> +			unsigned int compound_dtor;
> +			unsigned int compound_order;
> +#else
>  			unsigned short int compound_dtor;
>  			unsigned short int compound_order;
> +#endif
>  		};
>  
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 5/7] mm: make compound_head() robust
  2015-09-10 10:54     ` Vlastimil Babka
@ 2015-09-11 13:35       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-11 13:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm,
	Paul E. McKenney

On Thu, Sep 10, 2015 at 12:54:08PM +0200, Vlastimil Babka wrote:
> On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> > 	CPU0					CPU1
> > 
> > isolate_migratepages_block()
> >   page_count()
> >     compound_head()
> >       !!PageTail() == true
> > 					put_page()
> > 					  tail->first_page = NULL
> >       head = tail->first_page
> > 					alloc_pages(__GFP_COMP)
> > 					   prep_compound_page()
> > 					     tail->first_page = head
> > 					     __SetPageTail(p);
> >       !!PageTail() == true
> >     <head == NULL dereferencing>
> > 
> > The race is pure theoretical. I don't it's possible to trigger it in
> > practice. But who knows.
> > 
> > We can fix the race by changing how encode PageTail() and compound_head()
> > within struct page to be able to update them in one shot.
> > 
> > The patch introduces page->compound_head into third double word block in
> > front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
> > the rest bits are pointer to head page if bit zero is set.
> > 
> > The patch moves page->pmd_huge_pte out of word, just in case if an
> > architecture defines pgtable_t into something what can have the bit 0
> > set.
> > 
> > hugetlb_cgroup uses page->lru.next in the second tail page to store
> > pointer struct hugetlb_cgroup. The patch switch it to use page->private
> > in the second tail page instead. The space is free since ->first_page is
> > removed from the union.
> > 
> > The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
> > limitation, since there's now space in first tail page to store struct
> > hugetlb_cgroup pointer. But that's out of scope of the patch.
> > 
> > That means page->compound_head shares storage space with:
> > 
> >  - page->lru.next;
> >  - page->next;
> >  - page->rcu_head.next;
> > 
> > That's too long list to be absolutely sure, but looks like nobody uses
> > bit 0 of the word.
> 
> Given the discussion about rcu_head, that should warrant some summary here :)

Agreed.

> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -120,7 +120,13 @@ struct page {
> >  		};
> >  	};
> >  
> > -	/* Third double word block */
> > +	/*
> > +	 * Third double word block
> > +	 *
> > +	 * WARNING: bit 0 of the first word encode PageTail(). That means
> > +	 * the rest users of the storage space MUST NOT use the bit to
> > +	 * avoid collision and false-positive PageTail().
> > +	 */
> >  	union {
> >  		struct list_head lru;	/* Pageout list, eg. active_list
> >  					 * protected by zone->lru_lock !
> > @@ -143,12 +149,19 @@ struct page {
> >  						 */
> >  		/* First tail page of compound page */
> 
> "First tail" doesn't apply for compound_head.

I'll adjust comments.
 
> > index 097c7a4bfbd9..330377f83ac7 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
> >  				      (1L << PG_unevictable)));
> >  		page_tail->flags |= (1L << PG_dirty);
> >  
> > -		/* clear PageTail before overwriting first_page */
> > -		smp_wmb();
> > +		clear_compound_head(page_tail);
> 
> I would sleep better if this was done before setting all the page->flags above,
> previously, PageTail was cleared by the first operation which is
> "page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;"
> I do realize that it doesn't use WRITE_ONCE, so it might have been theoretically
> broken already, if it does matter.

Right. Nothing enforces particular order. If we really need to have some
ordering on PageTail() vs. page->flags let's define it, but so far I
don't see a reason to change this part.

> > diff --git a/mm/internal.h b/mm/internal.h
> > index 36b23f1e2ca6..89e21a07080a 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
> >  	 * speculative page access (like in
> >  	 * page_cache_get_speculative()) on tail pages.
> >  	 */
> > -	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
> > +	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
> >  	if (get_page_head)
> > -		atomic_inc(&page->first_page->_count);
> > +		atomic_inc(&compound_head(page)->_count);
> 
> Doing another compound_head() seems like overkill when this code already assumes
> PageTail.

"Overkill"? It's too strong wording for re-read hot cache line.

> All callers do it after if (PageTail()) which means they already did
> READ_ONCE(page->compound_head) and here they do another one. Potentially with
> different result in bit 0, which would be a subtle bug, that could be
> interesting to catch with some VM_BUG_ON. I don't know if a direct plain
> page->compound_head access here instead of compound_head() would also result in
> a re-read, since the previous access did use READ_ONCE(). Maybe it would be best
> to reorganize the code here and in the 3 call sites so that the READ_ONCE() used
> to determine PageTail also obtains the compound head pointer.

All we would possbily win by the change is few bytes in code. Additional
READ_ONCE() only affect code generation. It doesn't introduce any cpu
barriers. The cache line with compound_head is in L1 anyway.

I don't see justification to change this part too. If you think we can
gain something by reworking this code, feel free to propose patch on top.

> Some of that is probably made moot by your other series, but better let's think
> of this series as standalone first.
> 
> >  	get_huge_page_tail(page);
> >  }
> >  
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 1f4446a90cef..4d1a5de9653d 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
> >  #define lru		(1UL << PG_lru)
> >  #define swapbacked	(1UL << PG_swapbacked)
> >  #define head		(1UL << PG_head)
> > -#define tail		(1UL << PG_tail)
> > -#define compound	(1UL << PG_compound)
> >  #define slab		(1UL << PG_slab)
> >  #define reserved	(1UL << PG_reserved)
> >  
> > @@ -811,12 +809,7 @@ static struct page_state {
> >  	 */
> >  	{ slab,		slab,		MF_MSG_SLAB,	me_kernel },
> >  
> > -#ifdef CONFIG_PAGEFLAGS_EXTENDED
> >  	{ head,		head,		MF_MSG_HUGE,		me_huge_page },
> > -	{ tail,		tail,		MF_MSG_HUGE,		me_huge_page },
> > -#else
> > -	{ compound,	compound,	MF_MSG_HUGE,		me_huge_page },
> > -#endif
> >  
> >  	{ sc|dirty,	sc|dirty,	MF_MSG_DIRTY_SWAPCACHE,	me_swapcache_dirty },
> >  	{ sc|dirty,	sc,		MF_MSG_CLEAN_SWAPCACHE,	me_swapcache_clean },
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c6733cc3cbce..a56ad53ff553 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -424,15 +424,15 @@ out:
> >  /*
> >   * Higher-order pages are called "compound pages".  They are structured thusly:
> >   *
> > - * The first PAGE_SIZE page is called the "head page".
> > + * The first PAGE_SIZE page is called the "head page" and have PG_head set.
> >   *
> > - * The remaining PAGE_SIZE pages are called "tail pages".
> > + * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
> > + * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
> >   *
> > - * All pages have PG_compound set.  All tail pages have their ->first_page
> > - * pointing at the head page.
> > + * The first tail page's ->compound_dtor holds the offset in array of compound
> > + * page destructors. See compound_page_dtors.
> >   *
> > - * The first tail page's ->lru.next holds the address of the compound page's
> > - * put_page() function.  Its ->lru.prev holds the order of allocation.
> > + * The first tail page's ->compound_order holds the order of allocation.
> >   * This usage means that zero-order pages may not be compound.
> >   */
> >  
> > @@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
> >  	for (i = 1; i < nr_pages; i++) {
> >  		struct page *p = page + i;
> >  		set_page_count(p, 0);
> > -		p->first_page = page;
> > -		/* Make sure p->first_page is always valid for PageTail() */
> > -		smp_wmb();
> > -		__SetPageTail(p);
> > +		set_compound_head(p, page);
> >  	}
> >  }
> >  
> > @@ -830,17 +827,30 @@ static void free_one_page(struct zone *zone,
> >  
> >  static int free_tail_pages_check(struct page *head_page, struct page *page)
> >  {
> > -	if (!IS_ENABLED(CONFIG_DEBUG_VM))
> > -		return 0;
> > +	int ret = 1;
> > +
> > +	/*
> > +	 * We rely page->lru.next never has bit 0 set, unless the page
> > +	 * is PageTail(). Let's make sure that's true even for poisoned ->lru.
> > +	 */
> > +	BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
> > +
> > +	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
> > +		ret = 0;
> > +		goto out;
> > +	}
> >  	if (unlikely(!PageTail(page))) {
> >  		bad_page(page, "PageTail not set", 0);
> > -		return 1;
> > +		goto out;
> >  	}
> > -	if (unlikely(page->first_page != head_page)) {
> > -		bad_page(page, "first_page not consistent", 0);
> > -		return 1;
> > +	if (unlikely(compound_head(page) != head_page)) {
> > +		bad_page(page, "compound_head not consistent", 0);
> > +		goto out;
> 
> Same here, although for a DEBUG_VM config only it's not as important.

Ditto.

> 
> >  	}
> > -	return 0;
> > +	ret = 0;
> > +out:
> > +	clear_compound_head(page);
> > +	return ret;
> >  }
> >  
> >  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > @@ -888,6 +898,8 @@ static void init_reserved_page(unsigned long pfn)
> >  #else
> >  static inline void init_reserved_page(unsigned long pfn)
> >  {
> > +	/* Avoid false-positive PageTail() */
> > +	INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
> >  }
> >  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> >  
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a3a0a2f1f7c3..faa9e1687dea 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -200,7 +200,7 @@ out_put_single:
> >  				__put_single_page(page);
> >  			return;
> >  		}
> > -		VM_BUG_ON_PAGE(page_head != page->first_page, page);
> > +		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
> >  		/*
> >  		 * We can release the refcount taken by
> >  		 * get_page_unless_zero() now that
> > @@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
> >  	 *  Case 3 is possible, as we may race with
> >  	 *  __split_huge_page_refcount tearing down a THP page.
> >  	 */
> > -	page_head = compound_head_by_tail(page);
> > +	page_head = compound_head(page);
> 
> This is also in a path after PageTail() is true.

We can only save one branch here: other PageTail() is most likely in other
compilation unit and compiler would not be able to eliminate additional
load.
Why bother?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 5/7] mm: make compound_head() robust
@ 2015-09-11 13:35       ` Kirill A. Shutemov
  0 siblings, 0 replies; 34+ messages in thread
From: Kirill A. Shutemov @ 2015-09-11 13:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm,
	Paul E. McKenney

On Thu, Sep 10, 2015 at 12:54:08PM +0200, Vlastimil Babka wrote:
> On 09/03/2015 02:35 PM, Kirill A. Shutemov wrote:
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> > 	CPU0					CPU1
> > 
> > isolate_migratepages_block()
> >   page_count()
> >     compound_head()
> >       !!PageTail() == true
> > 					put_page()
> > 					  tail->first_page = NULL
> >       head = tail->first_page
> > 					alloc_pages(__GFP_COMP)
> > 					   prep_compound_page()
> > 					     tail->first_page = head
> > 					     __SetPageTail(p);
> >       !!PageTail() == true
> >     <head == NULL dereferencing>
> > 
> > The race is pure theoretical. I don't it's possible to trigger it in
> > practice. But who knows.
> > 
> > We can fix the race by changing how encode PageTail() and compound_head()
> > within struct page to be able to update them in one shot.
> > 
> > The patch introduces page->compound_head into third double word block in
> > front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
> > the rest bits are pointer to head page if bit zero is set.
> > 
> > The patch moves page->pmd_huge_pte out of word, just in case if an
> > architecture defines pgtable_t into something what can have the bit 0
> > set.
> > 
> > hugetlb_cgroup uses page->lru.next in the second tail page to store
> > pointer struct hugetlb_cgroup. The patch switch it to use page->private
> > in the second tail page instead. The space is free since ->first_page is
> > removed from the union.
> > 
> > The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
> > limitation, since there's now space in first tail page to store struct
> > hugetlb_cgroup pointer. But that's out of scope of the patch.
> > 
> > That means page->compound_head shares storage space with:
> > 
> >  - page->lru.next;
> >  - page->next;
> >  - page->rcu_head.next;
> > 
> > That's too long list to be absolutely sure, but looks like nobody uses
> > bit 0 of the word.
> 
> Given the discussion about rcu_head, that should warrant some summary here :)

Agreed.

> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -120,7 +120,13 @@ struct page {
> >  		};
> >  	};
> >  
> > -	/* Third double word block */
> > +	/*
> > +	 * Third double word block
> > +	 *
> > +	 * WARNING: bit 0 of the first word encode PageTail(). That means
> > +	 * the rest users of the storage space MUST NOT use the bit to
> > +	 * avoid collision and false-positive PageTail().
> > +	 */
> >  	union {
> >  		struct list_head lru;	/* Pageout list, eg. active_list
> >  					 * protected by zone->lru_lock !
> > @@ -143,12 +149,19 @@ struct page {
> >  						 */
> >  		/* First tail page of compound page */
> 
> "First tail" doesn't apply for compound_head.

I'll adjust comments.
 
> > index 097c7a4bfbd9..330377f83ac7 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
> >  				      (1L << PG_unevictable)));
> >  		page_tail->flags |= (1L << PG_dirty);
> >  
> > -		/* clear PageTail before overwriting first_page */
> > -		smp_wmb();
> > +		clear_compound_head(page_tail);
> 
> I would sleep better if this was done before setting all the page->flags above,
> previously, PageTail was cleared by the first operation which is
> "page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;"
> I do realize that it doesn't use WRITE_ONCE, so it might have been theoretically
> broken already, if it does matter.

Right. Nothing enforces particular order. If we really need to have some
ordering on PageTail() vs. page->flags let's define it, but so far I
don't see a reason to change this part.

> > diff --git a/mm/internal.h b/mm/internal.h
> > index 36b23f1e2ca6..89e21a07080a 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
> >  	 * speculative page access (like in
> >  	 * page_cache_get_speculative()) on tail pages.
> >  	 */
> > -	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
> > +	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
> >  	if (get_page_head)
> > -		atomic_inc(&page->first_page->_count);
> > +		atomic_inc(&compound_head(page)->_count);
> 
> Doing another compound_head() seems like overkill when this code already assumes
> PageTail.

"Overkill"? It's too strong wording for re-read hot cache line.

> All callers do it after if (PageTail()) which means they already did
> READ_ONCE(page->compound_head) and here they do another one. Potentially with
> different result in bit 0, which would be a subtle bug, that could be
> interesting to catch with some VM_BUG_ON. I don't know if a direct plain
> page->compound_head access here instead of compound_head() would also result in
> a re-read, since the previous access did use READ_ONCE(). Maybe it would be best
> to reorganize the code here and in the 3 call sites so that the READ_ONCE() used
> to determine PageTail also obtains the compound head pointer.

All we would possbily win by the change is few bytes in code. Additional
READ_ONCE() only affect code generation. It doesn't introduce any cpu
barriers. The cache line with compound_head is in L1 anyway.

I don't see justification to change this part too. If you think we can
gain something by reworking this code, feel free to propose patch on top.

> Some of that is probably made moot by your other series, but better let's think
> of this series as standalone first.
> 
> >  	get_huge_page_tail(page);
> >  }
> >  
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 1f4446a90cef..4d1a5de9653d 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -787,8 +787,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
> >  #define lru		(1UL << PG_lru)
> >  #define swapbacked	(1UL << PG_swapbacked)
> >  #define head		(1UL << PG_head)
> > -#define tail		(1UL << PG_tail)
> > -#define compound	(1UL << PG_compound)
> >  #define slab		(1UL << PG_slab)
> >  #define reserved	(1UL << PG_reserved)
> >  
> > @@ -811,12 +809,7 @@ static struct page_state {
> >  	 */
> >  	{ slab,		slab,		MF_MSG_SLAB,	me_kernel },
> >  
> > -#ifdef CONFIG_PAGEFLAGS_EXTENDED
> >  	{ head,		head,		MF_MSG_HUGE,		me_huge_page },
> > -	{ tail,		tail,		MF_MSG_HUGE,		me_huge_page },
> > -#else
> > -	{ compound,	compound,	MF_MSG_HUGE,		me_huge_page },
> > -#endif
> >  
> >  	{ sc|dirty,	sc|dirty,	MF_MSG_DIRTY_SWAPCACHE,	me_swapcache_dirty },
> >  	{ sc|dirty,	sc,		MF_MSG_CLEAN_SWAPCACHE,	me_swapcache_clean },
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c6733cc3cbce..a56ad53ff553 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -424,15 +424,15 @@ out:
> >  /*
> >   * Higher-order pages are called "compound pages".  They are structured thusly:
> >   *
> > - * The first PAGE_SIZE page is called the "head page".
> > + * The first PAGE_SIZE page is called the "head page" and have PG_head set.
> >   *
> > - * The remaining PAGE_SIZE pages are called "tail pages".
> > + * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
> > + * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
> >   *
> > - * All pages have PG_compound set.  All tail pages have their ->first_page
> > - * pointing at the head page.
> > + * The first tail page's ->compound_dtor holds the offset in array of compound
> > + * page destructors. See compound_page_dtors.
> >   *
> > - * The first tail page's ->lru.next holds the address of the compound page's
> > - * put_page() function.  Its ->lru.prev holds the order of allocation.
> > + * The first tail page's ->compound_order holds the order of allocation.
> >   * This usage means that zero-order pages may not be compound.
> >   */
> >  
> > @@ -452,10 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
> >  	for (i = 1; i < nr_pages; i++) {
> >  		struct page *p = page + i;
> >  		set_page_count(p, 0);
> > -		p->first_page = page;
> > -		/* Make sure p->first_page is always valid for PageTail() */
> > -		smp_wmb();
> > -		__SetPageTail(p);
> > +		set_compound_head(p, page);
> >  	}
> >  }
> >  
> > @@ -830,17 +827,30 @@ static void free_one_page(struct zone *zone,
> >  
> >  static int free_tail_pages_check(struct page *head_page, struct page *page)
> >  {
> > -	if (!IS_ENABLED(CONFIG_DEBUG_VM))
> > -		return 0;
> > +	int ret = 1;
> > +
> > +	/*
> > +	 * We rely page->lru.next never has bit 0 set, unless the page
> > +	 * is PageTail(). Let's make sure that's true even for poisoned ->lru.
> > +	 */
> > +	BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
> > +
> > +	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
> > +		ret = 0;
> > +		goto out;
> > +	}
> >  	if (unlikely(!PageTail(page))) {
> >  		bad_page(page, "PageTail not set", 0);
> > -		return 1;
> > +		goto out;
> >  	}
> > -	if (unlikely(page->first_page != head_page)) {
> > -		bad_page(page, "first_page not consistent", 0);
> > -		return 1;
> > +	if (unlikely(compound_head(page) != head_page)) {
> > +		bad_page(page, "compound_head not consistent", 0);
> > +		goto out;
> 
> Same here, although for a DEBUG_VM config only it's not as important.

Ditto.

> 
> >  	}
> > -	return 0;
> > +	ret = 0;
> > +out:
> > +	clear_compound_head(page);
> > +	return ret;
> >  }
> >  
> >  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > @@ -888,6 +898,8 @@ static void init_reserved_page(unsigned long pfn)
> >  #else
> >  static inline void init_reserved_page(unsigned long pfn)
> >  {
> > +	/* Avoid false-positive PageTail() */
> > +	INIT_LIST_HEAD(&pfn_to_page(pfn)->lru);
> >  }
> >  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> >  
> > diff --git a/mm/swap.c b/mm/swap.c
> > index a3a0a2f1f7c3..faa9e1687dea 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -200,7 +200,7 @@ out_put_single:
> >  				__put_single_page(page);
> >  			return;
> >  		}
> > -		VM_BUG_ON_PAGE(page_head != page->first_page, page);
> > +		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
> >  		/*
> >  		 * We can release the refcount taken by
> >  		 * get_page_unless_zero() now that
> > @@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
> >  	 *  Case 3 is possible, as we may race with
> >  	 *  __split_huge_page_refcount tearing down a THP page.
> >  	 */
> > -	page_head = compound_head_by_tail(page);
> > +	page_head = compound_head(page);
> 
> This is also in a path after PageTail() is true.

We can only save one branch here: other PageTail() is most likely in other
compilation unit and compiler would not be able to eliminate additional
load.
Why bother?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 5/7] mm: make compound_head() robust
  2015-09-11 13:35       ` Kirill A. Shutemov
@ 2015-09-14 13:31         ` Vlastimil Babka
  -1 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-14 13:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm,
	Paul E. McKenney

On 09/11/2015 03:35 PM, Kirill A. Shutemov wrote:
>>> index 097c7a4bfbd9..330377f83ac7 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
>>>   				      (1L << PG_unevictable)));
>>>   		page_tail->flags |= (1L << PG_dirty);
>>>
>>> -		/* clear PageTail before overwriting first_page */
>>> -		smp_wmb();
>>> +		clear_compound_head(page_tail);
>>
>> I would sleep better if this was done before setting all the page->flags above,
>> previously, PageTail was cleared by the first operation which is
>> "page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;"
>> I do realize that it doesn't use WRITE_ONCE, so it might have been theoretically
>> broken already, if it does matter.
>
> Right. Nothing enforces particular order. If we really need to have some
> ordering on PageTail() vs. page->flags let's define it, but so far I
> don't see a reason to change this part.

OK.

>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 36b23f1e2ca6..89e21a07080a 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
>>>   	 * speculative page access (like in
>>>   	 * page_cache_get_speculative()) on tail pages.
>>>   	 */
>>> -	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
>>> +	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
>>>   	if (get_page_head)
>>> -		atomic_inc(&page->first_page->_count);
>>> +		atomic_inc(&compound_head(page)->_count);
>>
>> Doing another compound_head() seems like overkill when this code already assumes
>> PageTail.
>
> "Overkill"? It's too strong wording for re-read hot cache line.

Hm good point.

>> All callers do it after if (PageTail()) which means they already did
>> READ_ONCE(page->compound_head) and here they do another one. Potentially with
>> different result in bit 0, which would be a subtle bug, that could be
>> interesting to catch with some VM_BUG_ON. I don't know if a direct plain
>> page->compound_head access here instead of compound_head() would also result in
>> a re-read, since the previous access did use READ_ONCE(). Maybe it would be best
>> to reorganize the code here and in the 3 call sites so that the READ_ONCE() used
>> to determine PageTail also obtains the compound head pointer.
>
> All we would possbily win by the change is few bytes in code. Additional
> READ_ONCE() only affect code generation. It doesn't introduce any cpu
> barriers. The cache line with compound_head is in L1 anyway.
>
> I don't see justification to change this part too. If you think we can
> gain something by reworking this code, feel free to propose patch on top.

OK, it's probably not worth:

add/remove: 0/0 grow/shrink: 1/3 up/down: 7/-66 (-59)
function                                     old     new   delta
follow_trans_huge_pmd                        516     523      +7
follow_page_pte                              905     893     -12
follow_hugetlb_page                          943     919     -24
__get_page_tail                              440     410     -30


>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index a3a0a2f1f7c3..faa9e1687dea 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -200,7 +200,7 @@ out_put_single:
>>>   				__put_single_page(page);
>>>   			return;
>>>   		}
>>> -		VM_BUG_ON_PAGE(page_head != page->first_page, page);
>>> +		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
>>>   		/*
>>>   		 * We can release the refcount taken by
>>>   		 * get_page_unless_zero() now that
>>> @@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
>>>   	 *  Case 3 is possible, as we may race with
>>>   	 *  __split_huge_page_refcount tearing down a THP page.
>>>   	 */
>>> -	page_head = compound_head_by_tail(page);
>>> +	page_head = compound_head(page);
>>
>> This is also in a path after PageTail() is true.
>
> We can only save one branch here: other PageTail() is most likely in other
> compilation unit and compiler would not be able to eliminate additional
> load.
> Why bother?

Hmm, right. Bah.

add/remove: 0/0 grow/shrink: 1/0 up/down: 8/0 (8)
function                                     old     new   delta
put_compound_page                            500     508      +8




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHv5 5/7] mm: make compound_head() robust
@ 2015-09-14 13:31         ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2015-09-14 13:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Dave Hansen, Johannes Weiner, Michal Hocko,
	David Rientjes, Aneesh Kumar K.V, linux-kernel, linux-mm,
	Paul E. McKenney

On 09/11/2015 03:35 PM, Kirill A. Shutemov wrote:
>>> index 097c7a4bfbd9..330377f83ac7 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1686,8 +1686,7 @@ static void __split_huge_page_refcount(struct page *page,
>>>   				      (1L << PG_unevictable)));
>>>   		page_tail->flags |= (1L << PG_dirty);
>>>
>>> -		/* clear PageTail before overwriting first_page */
>>> -		smp_wmb();
>>> +		clear_compound_head(page_tail);
>>
>> I would sleep better if this was done before setting all the page->flags above,
>> previously, PageTail was cleared by the first operation which is
>> "page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;"
>> I do realize that it doesn't use WRITE_ONCE, so it might have been theoretically
>> broken already, if it does matter.
>
> Right. Nothing enforces particular order. If we really need to have some
> ordering on PageTail() vs. page->flags let's define it, but so far I
> don't see a reason to change this part.

OK.

>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 36b23f1e2ca6..89e21a07080a 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -61,9 +61,9 @@ static inline void __get_page_tail_foll(struct page *page,
>>>   	 * speculative page access (like in
>>>   	 * page_cache_get_speculative()) on tail pages.
>>>   	 */
>>> -	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
>>> +	VM_BUG_ON_PAGE(atomic_read(&compound_head(page)->_count) <= 0, page);
>>>   	if (get_page_head)
>>> -		atomic_inc(&page->first_page->_count);
>>> +		atomic_inc(&compound_head(page)->_count);
>>
>> Doing another compound_head() seems like overkill when this code already assumes
>> PageTail.
>
> "Overkill"? It's too strong wording for re-read hot cache line.

Hm good point.

>> All callers do it after if (PageTail()) which means they already did
>> READ_ONCE(page->compound_head) and here they do another one. Potentially with
>> different result in bit 0, which would be a subtle bug, that could be
>> interesting to catch with some VM_BUG_ON. I don't know if a direct plain
>> page->compound_head access here instead of compound_head() would also result in
>> a re-read, since the previous access did use READ_ONCE(). Maybe it would be best
>> to reorganize the code here and in the 3 call sites so that the READ_ONCE() used
>> to determine PageTail also obtains the compound head pointer.
>
> All we would possbily win by the change is few bytes in code. Additional
> READ_ONCE() only affect code generation. It doesn't introduce any cpu
> barriers. The cache line with compound_head is in L1 anyway.
>
> I don't see justification to change this part too. If you think we can
> gain something by reworking this code, feel free to propose patch on top.

OK, it's probably not worth:

add/remove: 0/0 grow/shrink: 1/3 up/down: 7/-66 (-59)
function                                     old     new   delta
follow_trans_huge_pmd                        516     523      +7
follow_page_pte                              905     893     -12
follow_hugetlb_page                          943     919     -24
__get_page_tail                              440     410     -30


>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index a3a0a2f1f7c3..faa9e1687dea 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -200,7 +200,7 @@ out_put_single:
>>>   				__put_single_page(page);
>>>   			return;
>>>   		}
>>> -		VM_BUG_ON_PAGE(page_head != page->first_page, page);
>>> +		VM_BUG_ON_PAGE(page_head != compound_head(page), page);
>>>   		/*
>>>   		 * We can release the refcount taken by
>>>   		 * get_page_unless_zero() now that
>>> @@ -261,7 +261,7 @@ static void put_compound_page(struct page *page)
>>>   	 *  Case 3 is possible, as we may race with
>>>   	 *  __split_huge_page_refcount tearing down a THP page.
>>>   	 */
>>> -	page_head = compound_head_by_tail(page);
>>> +	page_head = compound_head(page);
>>
>> This is also in a path after PageTail() is true.
>
> We can only save one branch here: other PageTail() is most likely in other
> compilation unit and compiler would not be able to eliminate additional
> load.
> Why bother?

Hmm, right. Bah.

add/remove: 0/0 grow/shrink: 1/0 up/down: 8/0 (8)
function                                     old     new   delta
put_compound_page                            500     508      +8



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-09-14 13:31 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-03 12:35 [PATCHv5 0/7] Fix compound_head() race Kirill A. Shutemov
2015-09-03 12:35 ` Kirill A. Shutemov
2015-09-03 12:35 ` [PATCHv5 1/7] mm: drop page->slab_page Kirill A. Shutemov
2015-09-03 12:35   ` Kirill A. Shutemov
2015-09-07  5:00   ` Alexander Duyck
2015-09-07  5:00     ` Alexander Duyck
2015-09-07 12:19     ` Kirill A. Shutemov
2015-09-07 12:19       ` Kirill A. Shutemov
2015-09-03 12:35 ` [PATCHv5 2/7] slub: use page->rcu_head instead of page->lru plus cast Kirill A. Shutemov
2015-09-03 12:35   ` Kirill A. Shutemov
2015-09-03 12:35 ` [PATCHv5 3/7] zsmalloc: use page->private instead of page->first_page Kirill A. Shutemov
2015-09-03 12:35   ` Kirill A. Shutemov
2015-09-03 12:35 ` [PATCHv5 4/7] mm: pack compound_dtor and compound_order into one word in struct page Kirill A. Shutemov
2015-09-03 12:35   ` Kirill A. Shutemov
2015-09-10  9:41   ` Vlastimil Babka
2015-09-10  9:41     ` Vlastimil Babka
2015-09-03 12:35 ` [PATCHv5 5/7] mm: make compound_head() robust Kirill A. Shutemov
2015-09-03 12:35   ` Kirill A. Shutemov
2015-09-10 10:54   ` Vlastimil Babka
2015-09-10 10:54     ` Vlastimil Babka
2015-09-11 13:35     ` Kirill A. Shutemov
2015-09-11 13:35       ` Kirill A. Shutemov
2015-09-14 13:31       ` Vlastimil Babka
2015-09-14 13:31         ` Vlastimil Babka
2015-09-03 12:35 ` [PATCHv5 6/7] mm: use 'unsigned int' for page order Kirill A. Shutemov
2015-09-03 12:35   ` Kirill A. Shutemov
2015-09-10 11:04   ` Vlastimil Babka
2015-09-10 11:04     ` Vlastimil Babka
2015-09-03 12:35 ` [PATCHv5 7/7] mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT Kirill A. Shutemov
2015-09-03 12:35   ` Kirill A. Shutemov
2015-09-10 11:34   ` Vlastimil Babka
2015-09-10 11:34     ` Vlastimil Babka
2015-09-04 13:43 ` [PATCHv5 0/7] Fix compound_head() race Andrea Arcangeli
2015-09-04 13:43   ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.