LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH] mm/huge_memory.c: split should clone page flags before unfreezing pageref
@ 2018-02-11 10:35 Konstantin Khlebnikov
  2018-02-11 11:07 ` Kirill A. Shutemov
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-11 10:35 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, linux-kernel
  Cc: Michal Hocko, Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

THP split makes non-atomic change of tail page flags. This is almost ok
because tail pages are locked and isolated but this breaks recent changes
in page locking: non-atomic operation could clear bit PG_waiters.

As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever. Especially if this page was truncated later.

Fix is trivial: clone flags before unfreezing page reference counter.

This race exists since commit 62906027091f ("mm: add PageWaiters indicating
tasks are waiting for a page bit") while unsave unfreeze itself was added
in commit 8df651c7059e ("thp: cleanup split_huge_page()").

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/huge_memory.c |   25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87ab9b8f56b5..2b38d9f2f262 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2357,6 +2357,19 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
 	VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail);
 
+	/* Clone page flags before unfreezing refcount. */
+	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	page_tail->flags |= (head->flags &
+			((1L << PG_referenced) |
+			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
+			 (1L << PG_mlocked) |
+			 (1L << PG_uptodate) |
+			 (1L << PG_active) |
+			 (1L << PG_locked) |
+			 (1L << PG_unevictable) |
+			 (1L << PG_dirty)));
+
 	/*
 	 * tail_page->_refcount is zero and not changing from under us. But
 	 * get_page_unless_zero() may be running from under us on the
@@ -2375,18 +2388,6 @@ static void __split_huge_page_tail(struct page *head, int tail,
 		page_ref_add(page_tail, 2);
 	}
 
-	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-	page_tail->flags |= (head->flags &
-			((1L << PG_referenced) |
-			 (1L << PG_swapbacked) |
-			 (1L << PG_swapcache) |
-			 (1L << PG_mlocked) |
-			 (1L << PG_uptodate) |
-			 (1L << PG_active) |
-			 (1L << PG_locked) |
-			 (1L << PG_unevictable) |
-			 (1L << PG_dirty)));
-
 	/*
 	 * After clearing PageTail the gup refcount can be released.
 	 * Page flags also must be visible before we make the page non-compound.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/huge_memory.c: split should clone page flags before unfreezing pageref
  2018-02-11 10:35 [PATCH] mm/huge_memory.c: split should clone page flags before unfreezing pageref Konstantin Khlebnikov
@ 2018-02-11 11:07 ` Kirill A. Shutemov
  2018-02-11 13:13   ` Konstantin Khlebnikov
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-11 11:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
	Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

On Sun, Feb 11, 2018 at 01:35:17PM +0300, Konstantin Khlebnikov wrote:
> THP split makes non-atomic change of tail page flags. This is almost ok
> because tail pages are locked and isolated but this breaks recent changes
> in page locking: non-atomic operation could clear bit PG_waiters.
> 
> As a result concurrent sequence get_page_unless_zero() -> lock_page()
> might block forever. Especially if this page was truncated later.
> 
> Fix is trivial: clone flags before unfreezing page reference counter.
> 
> This race exists since commit 62906027091f ("mm: add PageWaiters indicating
> tasks are waiting for a page bit") while unsave unfreeze itself was added
> in commit 8df651c7059e ("thp: cleanup split_huge_page()").

Hm. Don't we have to have barrier between setting flags and updating
the refcounter in this case? Atomics don't generally have this semantics,
so you can see new refcount before new flags even after the change.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm/huge_memory.c: split should clone page flags before unfreezing pageref
  2018-02-11 11:07 ` Kirill A. Shutemov
@ 2018-02-11 13:13   ` Konstantin Khlebnikov
  2018-02-11 14:29     ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-11 13:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
	Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

On 11.02.2018 14:07, Kirill A. Shutemov wrote:
> On Sun, Feb 11, 2018 at 01:35:17PM +0300, Konstantin Khlebnikov wrote:
>> THP split makes non-atomic change of tail page flags. This is almost ok
>> because tail pages are locked and isolated but this breaks recent changes
>> in page locking: non-atomic operation could clear bit PG_waiters.
>>
>> As a result concurrent sequence get_page_unless_zero() -> lock_page()
>> might block forever. Especially if this page was truncated later.
>>
>> Fix is trivial: clone flags before unfreezing page reference counter.
>>
>> This race exists since commit 62906027091f ("mm: add PageWaiters indicating
>> tasks are waiting for a page bit") while unsave unfreeze itself was added
>> in commit 8df651c7059e ("thp: cleanup split_huge_page()").
> 
> Hm. Don't we have to have barrier between setting flags and updating
> the refcounter in this case? Atomics don't generally have this semantics,
> so you can see new refcount before new flags even after the change.
> 

Ok.

I see another problem here - clear_compound_head() is placed after unfreeze.

This opens race window with get/put_page after speculative get page.
I think successful get_page_unless_zero() must stabilize compound_head() for tails as well as for heads.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
  2018-02-11 13:13   ` Konstantin Khlebnikov
@ 2018-02-11 14:29     ` Konstantin Khlebnikov
  2018-02-11 15:14       ` Kirill A. Shutemov
  2018-02-11 20:09       ` Matthew Wilcox
  2018-02-12 13:58     ` [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze Konstantin Khlebnikov
  2018-02-12 13:58     ` [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
  2 siblings, 2 replies; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-11 14:29 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, linux-kernel
  Cc: Michal Hocko, Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

THP split makes non-atomic change of tail page flags. This is almost ok
because tail pages are locked and isolated but this breaks recent changes
in page locking: non-atomic operation could clear bit PG_waiters.

As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever. Especially if this page was truncated later.

Fix is trivial: clone flags before unfreezing page reference counter.

This race exists since commit 62906027091f ("mm: add PageWaiters indicating
tasks are waiting for a page bit") while unsave unfreeze itself was added
in commit 8df651c7059e ("thp: cleanup split_huge_page()").

clear_compound_head() also must be called before unfreezing page reference
because successful get_page_unless_zero() must stabilize compound_head().

And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
is made especially for that and has semantic of smp_store_release().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/huge_memory.c |   34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87ab9b8f56b5..fa577aa7ecd8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2355,26 +2355,11 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	struct page *page_tail = head + tail;
 
 	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
-	VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail);
 
 	/*
-	 * tail_page->_refcount is zero and not changing from under us. But
-	 * get_page_unless_zero() may be running from under us on the
-	 * tail_page. If we used atomic_set() below instead of atomic_inc() or
-	 * atomic_add(), we would then run atomic_set() concurrently with
-	 * get_page_unless_zero(), and atomic_set() is implemented in C not
-	 * using locked ops. spin_unlock on x86 sometime uses locked ops
-	 * because of PPro errata 66, 92, so unless somebody can guarantee
-	 * atomic_set() here would be safe on all archs (and not only on x86),
-	 * it's safer to use atomic_inc()/atomic_add().
+	 * Clone page flags before unfreezing refcount.
+	 * lock_page() after speculative get wants to set PG_waiters.
 	 */
-	if (PageAnon(head) && !PageSwapCache(head)) {
-		page_ref_inc(page_tail);
-	} else {
-		/* Additional pin to radix tree */
-		page_ref_add(page_tail, 2);
-	}
-
 	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	page_tail->flags |= (head->flags &
 			((1L << PG_referenced) |
@@ -2387,14 +2372,21 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_unevictable) |
 			 (1L << PG_dirty)));
 
-	/*
-	 * After clearing PageTail the gup refcount can be released.
-	 * Page flags also must be visible before we make the page non-compound.
-	 */
+	/* Page flags must be visible before we make the page non-compound. */
 	smp_wmb();
 
+	/*
+	 * Clear PageTail before unfreezing page refcount:
+	 * speculative refcount must stabilize compound_head().
+	 */
 	clear_compound_head(page_tail);
 
+	/*
+	 * Finally unfreeze refcount. Additional pin to radix tree.
+	 */
+	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
+					  PageSwapCache(head)));
+
 	if (page_is_young(head))
 		set_page_young(page_tail);
 	if (page_is_idle(head))

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
  2018-02-11 14:29     ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
@ 2018-02-11 15:14       ` Kirill A. Shutemov
  2018-02-11 15:32         ` Konstantin Khlebnikov
  2018-02-11 20:09       ` Matthew Wilcox
  1 sibling, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-11 15:14 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
	Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
> is made especially for that and has semantic of smp_store_release().

Nak on this part.

page_ref_unfreeze() uses atomic_set() which neglects the situation in the
comment you're removing.

You need at least explain why it's safe now.

I would rather leave page_ref_inc()/page_ref_add() + explcit
smp_mb__before_atomic().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
  2018-02-11 15:14       ` Kirill A. Shutemov
@ 2018-02-11 15:32         ` Konstantin Khlebnikov
  2018-02-11 15:47           ` Kirill A. Shutemov
  0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-11 15:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Konstantin Khlebnikov, linux-mm, Andrew Morton,
	Linux Kernel Mailing List, Michal Hocko, Linus Torvalds,
	Kirill A. Shutemov, Nicholas Piggin

On Sun, Feb 11, 2018 at 6:14 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
>> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
>> is made especially for that and has semantic of smp_store_release().
>
> Nak on this part.
>
> page_ref_unfreeze() uses atomic_set() which neglects the situation in the
> comment you're removing.

Why? look into x86 smp_store_release
for PPro it use same sequence smp_wb + WRITE_ONCE
As I see spin_unlock uses exactly this macro.

Anyway if page_ref_unfreeze cannot handle races with
get_page_unless_zero() then it completely useless,

>
> You need at least explain why it's safe now.
>
> I would rather leave page_ref_inc()/page_ref_add() + explcit
> smp_mb__before_atomic().
>
> --
>  Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
  2018-02-11 15:32         ` Konstantin Khlebnikov
@ 2018-02-11 15:47           ` Kirill A. Shutemov
  2018-02-11 15:55             ` Konstantin Khlebnikov
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-11 15:47 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Konstantin Khlebnikov, linux-mm, Andrew Morton,
	Linux Kernel Mailing List, Michal Hocko, Linus Torvalds,
	Kirill A. Shutemov, Nicholas Piggin

On Sun, Feb 11, 2018 at 06:32:10PM +0300, Konstantin Khlebnikov wrote:
> On Sun, Feb 11, 2018 at 6:14 PM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> > On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
> >> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
> >> is made especially for that and has semantic of smp_store_release().
> >
> > Nak on this part.
> >
> > page_ref_unfreeze() uses atomic_set() which neglects the situation in the
> > comment you're removing.
> 
> Why? look into x86 smp_store_release
> for PPro it use same sequence smp_wb + WRITE_ONCE
> As I see spin_unlock uses exactly this macro.
> 
> Anyway if page_ref_unfreeze cannot handle races with
> get_page_unless_zero() then it completely useless,

Okay, fair enough.

But please call it out it the commit message.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
  2018-02-11 15:47           ` Kirill A. Shutemov
@ 2018-02-11 15:55             ` Konstantin Khlebnikov
  0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-11 15:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Konstantin Khlebnikov, linux-mm, Andrew Morton,
	Linux Kernel Mailing List, Michal Hocko, Linus Torvalds,
	Kirill A. Shutemov, Nicholas Piggin

On Sun, Feb 11, 2018 at 6:47 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Sun, Feb 11, 2018 at 06:32:10PM +0300, Konstantin Khlebnikov wrote:
>> On Sun, Feb 11, 2018 at 6:14 PM, Kirill A. Shutemov
>> <kirill@shutemov.name> wrote:
>> > On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
>> >> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
>> >> is made especially for that and has semantic of smp_store_release().
>> >
>> > Nak on this part.
>> >
>> > page_ref_unfreeze() uses atomic_set() which neglects the situation in the
>> > comment you're removing.
>>
>> Why? look into x86 smp_store_release
>> for PPro it use same sequence smp_wb + WRITE_ONCE
>> As I see spin_unlock uses exactly this macro.
>>
>> Anyway if page_ref_unfreeze cannot handle races with
>> get_page_unless_zero() then it completely useless,
>
> Okay, fair enough.
>
> But please call it out it the commit message.

OK, I'll review this yet again and resend tomorrow.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
  2018-02-11 14:29     ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
  2018-02-11 15:14       ` Kirill A. Shutemov
@ 2018-02-11 20:09       ` Matthew Wilcox
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2018-02-11 20:09 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
	Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
> +	/*
> +	 * Finally unfreeze refcount. Additional pin to radix tree.
> +	 */
> +	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
> +					  PageSwapCache(head)));

Please say "Additional pin from page cache", not "radix tree".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze
  2018-02-11 13:13   ` Konstantin Khlebnikov
  2018-02-11 14:29     ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
@ 2018-02-12 13:58     ` Konstantin Khlebnikov
  2018-02-12 14:07       ` Kirill A. Shutemov
  2018-02-12 13:58     ` [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
  2 siblings, 1 reply; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-12 13:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, linux-kernel
  Cc: Michal Hocko, Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

page_ref_unfreeze() has exactly that semantic. No functional
changes: just minus one barrier and proper handling of PPro errata.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 include/linux/page_ref.h |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 760d74a0e9a9..14d14beb1f7f 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -175,8 +175,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
 	VM_BUG_ON_PAGE(page_count(page) != 0, page);
 	VM_BUG_ON(count == 0);
 
-	smp_mb();
-	atomic_set(&page->_refcount, count);
+	atomic_set_release(&page->_refcount, count);
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
 		__page_ref_unfreeze(page, count);
 }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
  2018-02-11 13:13   ` Konstantin Khlebnikov
  2018-02-11 14:29     ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
  2018-02-12 13:58     ` [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze Konstantin Khlebnikov
@ 2018-02-12 13:58     ` Konstantin Khlebnikov
  2018-02-12 14:11       ` Kirill A. Shutemov
  2 siblings, 1 reply; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-12 13:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, linux-kernel
  Cc: Michal Hocko, Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

THP split makes non-atomic change of tail page flags. This is almost ok
because tail pages are locked and isolated but this breaks recent changes
in page locking: non-atomic operation could clear bit PG_waiters.

As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever. Especially if this page was truncated later.

Fix is trivial: clone flags before unfreezing page reference counter.

This race exists since commit 62906027091f ("mm: add PageWaiters indicating
tasks are waiting for a page bit") while unsave unfreeze itself was added
in commit 8df651c7059e ("thp: cleanup split_huge_page()").

clear_compound_head() also must be called before unfreezing page reference
because after successful get_page_unless_zero() might follow put_page()
which needs correct compound_head().

And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
is made especially for that and has semantic of smp_store_release().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/huge_memory.c |   36 +++++++++++++++---------------------
 1 file changed, 15 insertions(+), 21 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87ab9b8f56b5..7a8ead256b2f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2355,26 +2355,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	struct page *page_tail = head + tail;
 
 	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
-	VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail);
 
 	/*
-	 * tail_page->_refcount is zero and not changing from under us. But
-	 * get_page_unless_zero() may be running from under us on the
-	 * tail_page. If we used atomic_set() below instead of atomic_inc() or
-	 * atomic_add(), we would then run atomic_set() concurrently with
-	 * get_page_unless_zero(), and atomic_set() is implemented in C not
-	 * using locked ops. spin_unlock on x86 sometime uses locked ops
-	 * because of PPro errata 66, 92, so unless somebody can guarantee
-	 * atomic_set() here would be safe on all archs (and not only on x86),
-	 * it's safer to use atomic_inc()/atomic_add().
+	 * Clone page flags before unfreezing refcount.
+	 *
+	 * After successful get_page_unless_zero() might follow flags change,
+	 * for exmaple lock_page() which set PG_waiters.
 	 */
-	if (PageAnon(head) && !PageSwapCache(head)) {
-		page_ref_inc(page_tail);
-	} else {
-		/* Additional pin to radix tree */
-		page_ref_add(page_tail, 2);
-	}
-
 	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	page_tail->flags |= (head->flags &
 			((1L << PG_referenced) |
@@ -2387,14 +2374,21 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_unevictable) |
 			 (1L << PG_dirty)));
 
-	/*
-	 * After clearing PageTail the gup refcount can be released.
-	 * Page flags also must be visible before we make the page non-compound.
-	 */
+	/* Page flags must be visible before we make the page non-compound. */
 	smp_wmb();
 
+	/*
+	 * Clear PageTail before unfreezing page refcount.
+	 *
+	 * After successful get_page_unless_zero() might follow put_page()
+	 * which needs correct compound_head().
+	 */
 	clear_compound_head(page_tail);
 
+	/* Finally unfreeze refcount. Additional reference from page cache. */
+	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
+					  PageSwapCache(head)));
+
 	if (page_is_young(head))
 		set_page_young(page_tail);
 	if (page_is_idle(head))

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze
  2018-02-12 13:58     ` [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze Konstantin Khlebnikov
@ 2018-02-12 14:07       ` Kirill A. Shutemov
  0 siblings, 0 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-12 14:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
	Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

On Mon, Feb 12, 2018 at 04:58:50PM +0300, Konstantin Khlebnikov wrote:
> page_ref_unfreeze() has exactly that semantic. No functional
> changes: just minus one barrier and proper handling of PPro errata.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
  2018-02-12 13:58     ` [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
@ 2018-02-12 14:11       ` Kirill A. Shutemov
  0 siblings, 0 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-12 14:11 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
	Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin

On Mon, Feb 12, 2018 at 04:58:53PM +0300, Konstantin Khlebnikov wrote:
> THP split makes non-atomic change of tail page flags. This is almost ok
> because tail pages are locked and isolated but this breaks recent changes
> in page locking: non-atomic operation could clear bit PG_waiters.
> 
> As a result concurrent sequence get_page_unless_zero() -> lock_page()
> might block forever. Especially if this page was truncated later.
> 
> Fix is trivial: clone flags before unfreezing page reference counter.
> 
> This race exists since commit 62906027091f ("mm: add PageWaiters indicating
> tasks are waiting for a page bit") while unsave unfreeze itself was added
> in commit 8df651c7059e ("thp: cleanup split_huge_page()").
> 
> clear_compound_head() also must be called before unfreezing page reference
> because after successful get_page_unless_zero() might follow put_page()
> which needs correct compound_head().
> 
> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
> is made especially for that and has semantic of smp_store_release().
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-11 10:35 [PATCH] mm/huge_memory.c: split should clone page flags before unfreezing pageref Konstantin Khlebnikov
2018-02-11 11:07 ` Kirill A. Shutemov
2018-02-11 13:13   ` Konstantin Khlebnikov
2018-02-11 14:29     ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
2018-02-11 15:14       ` Kirill A. Shutemov
2018-02-11 15:32         ` Konstantin Khlebnikov
2018-02-11 15:47           ` Kirill A. Shutemov
2018-02-11 15:55             ` Konstantin Khlebnikov
2018-02-11 20:09       ` Matthew Wilcox
2018-02-12 13:58     ` [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze Konstantin Khlebnikov
2018-02-12 14:07       ` Kirill A. Shutemov
2018-02-12 13:58     ` [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
2018-02-12 14:11       ` Kirill A. Shutemov

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox