* [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
2018-02-11 13:13 ` Konstantin Khlebnikov
@ 2018-02-11 14:29 ` Konstantin Khlebnikov
2018-02-11 15:14 ` Kirill A. Shutemov
2018-02-11 20:09 ` Matthew Wilcox
2018-02-12 13:58 ` [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze Konstantin Khlebnikov
2018-02-12 13:58 ` [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
2 siblings, 2 replies; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-11 14:29 UTC (permalink / raw)
To: linux-mm, Andrew Morton, linux-kernel
Cc: Michal Hocko, Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin
THP split makes non-atomic change of tail page flags. This is almost ok
because tail pages are locked and isolated but this breaks recent changes
in page locking: non-atomic operation could clear bit PG_waiters.
As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever. Especially if this page was truncated later.
Fix is trivial: clone flags before unfreezing page reference counter.
This race exists since commit 62906027091f ("mm: add PageWaiters indicating
tasks are waiting for a page bit") while unsave unfreeze itself was added
in commit 8df651c7059e ("thp: cleanup split_huge_page()").
clear_compound_head() also must be called before unfreezing page reference
because successful get_page_unless_zero() must stabilize compound_head().
And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
is made especially for that and has semantic of smp_store_release().
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
mm/huge_memory.c | 34 +++++++++++++---------------------
1 file changed, 13 insertions(+), 21 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87ab9b8f56b5..fa577aa7ecd8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2355,26 +2355,11 @@ static void __split_huge_page_tail(struct page *head, int tail,
struct page *page_tail = head + tail;
VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
- VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail);
/*
- * tail_page->_refcount is zero and not changing from under us. But
- * get_page_unless_zero() may be running from under us on the
- * tail_page. If we used atomic_set() below instead of atomic_inc() or
- * atomic_add(), we would then run atomic_set() concurrently with
- * get_page_unless_zero(), and atomic_set() is implemented in C not
- * using locked ops. spin_unlock on x86 sometime uses locked ops
- * because of PPro errata 66, 92, so unless somebody can guarantee
- * atomic_set() here would be safe on all archs (and not only on x86),
- * it's safer to use atomic_inc()/atomic_add().
+ * Clone page flags before unfreezing refcount.
+ * lock_page() after speculative get wants to set PG_waiters.
*/
- if (PageAnon(head) && !PageSwapCache(head)) {
- page_ref_inc(page_tail);
- } else {
- /* Additional pin to radix tree */
- page_ref_add(page_tail, 2);
- }
-
page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
page_tail->flags |= (head->flags &
((1L << PG_referenced) |
@@ -2387,14 +2372,21 @@ static void __split_huge_page_tail(struct page *head, int tail,
(1L << PG_unevictable) |
(1L << PG_dirty)));
- /*
- * After clearing PageTail the gup refcount can be released.
- * Page flags also must be visible before we make the page non-compound.
- */
+ /* Page flags must be visible before we make the page non-compound. */
smp_wmb();
+ /*
+ * Clear PageTail before unfreezing page refcount:
+ * speculative refcount must stabilize compound_head().
+ */
clear_compound_head(page_tail);
+ /*
+ * Finally unfreeze refcount. Additional pin to radix tree.
+ */
+ page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
+ PageSwapCache(head)));
+
if (page_is_young(head))
set_page_young(page_tail);
if (page_is_idle(head))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
2018-02-11 14:29 ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
@ 2018-02-11 15:14 ` Kirill A. Shutemov
2018-02-11 15:32 ` Konstantin Khlebnikov
2018-02-11 20:09 ` Matthew Wilcox
1 sibling, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-11 15:14 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin
On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
> is made especially for that and has semantic of smp_store_release().
Nak on this part.
page_ref_unfreeze() uses atomic_set() which neglects the situation in the
comment you're removing.
You need at least explain why it's safe now.
I would rather leave page_ref_inc()/page_ref_add() + explcit
smp_mb__before_atomic().
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
2018-02-11 15:14 ` Kirill A. Shutemov
@ 2018-02-11 15:32 ` Konstantin Khlebnikov
2018-02-11 15:47 ` Kirill A. Shutemov
0 siblings, 1 reply; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-11 15:32 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Konstantin Khlebnikov, linux-mm, Andrew Morton,
Linux Kernel Mailing List, Michal Hocko, Linus Torvalds,
Kirill A. Shutemov, Nicholas Piggin
On Sun, Feb 11, 2018 at 6:14 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
>> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
>> is made especially for that and has semantic of smp_store_release().
>
> Nak on this part.
>
> page_ref_unfreeze() uses atomic_set() which neglects the situation in the
> comment you're removing.
Why? look into x86 smp_store_release
for PPro it use same sequence smp_wb + WRITE_ONCE
As I see spin_unlock uses exactly this macro.
Anyway if page_ref_unfreeze cannot handle races with
get_page_unless_zero() then it completely useless,
>
> You need at least explain why it's safe now.
>
> I would rather leave page_ref_inc()/page_ref_add() + explcit
> smp_mb__before_atomic().
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
2018-02-11 15:32 ` Konstantin Khlebnikov
@ 2018-02-11 15:47 ` Kirill A. Shutemov
2018-02-11 15:55 ` Konstantin Khlebnikov
0 siblings, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-11 15:47 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: Konstantin Khlebnikov, linux-mm, Andrew Morton,
Linux Kernel Mailing List, Michal Hocko, Linus Torvalds,
Kirill A. Shutemov, Nicholas Piggin
On Sun, Feb 11, 2018 at 06:32:10PM +0300, Konstantin Khlebnikov wrote:
> On Sun, Feb 11, 2018 at 6:14 PM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> > On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
> >> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
> >> is made especially for that and has semantic of smp_store_release().
> >
> > Nak on this part.
> >
> > page_ref_unfreeze() uses atomic_set() which neglects the situation in the
> > comment you're removing.
>
> Why? look into x86 smp_store_release
> for PPro it use same sequence smp_wb + WRITE_ONCE
> As I see spin_unlock uses exactly this macro.
>
> Anyway if page_ref_unfreeze cannot handle races with
> get_page_unless_zero() then it completely useless,
Okay, fair enough.
But please call it out it the commit message.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
2018-02-11 15:47 ` Kirill A. Shutemov
@ 2018-02-11 15:55 ` Konstantin Khlebnikov
0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-11 15:55 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Konstantin Khlebnikov, linux-mm, Andrew Morton,
Linux Kernel Mailing List, Michal Hocko, Linus Torvalds,
Kirill A. Shutemov, Nicholas Piggin
On Sun, Feb 11, 2018 at 6:47 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Sun, Feb 11, 2018 at 06:32:10PM +0300, Konstantin Khlebnikov wrote:
>> On Sun, Feb 11, 2018 at 6:14 PM, Kirill A. Shutemov
>> <kirill@shutemov.name> wrote:
>> > On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
>> >> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
>> >> is made especially for that and has semantic of smp_store_release().
>> >
>> > Nak on this part.
>> >
>> > page_ref_unfreeze() uses atomic_set() which neglects the situation in the
>> > comment you're removing.
>>
>> Why? look into x86 smp_store_release
>> for PPro it use same sequence smp_wb + WRITE_ONCE
>> As I see spin_unlock uses exactly this macro.
>>
>> Anyway if page_ref_unfreeze cannot handle races with
>> get_page_unless_zero() then it completely useless,
>
> Okay, fair enough.
>
> But please call it out it the commit message.
OK, I'll review this yet again and resend tomorrow.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
2018-02-11 14:29 ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
2018-02-11 15:14 ` Kirill A. Shutemov
@ 2018-02-11 20:09 ` Matthew Wilcox
1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2018-02-11 20:09 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin
On Sun, Feb 11, 2018 at 05:29:37PM +0300, Konstantin Khlebnikov wrote:
> + /*
> + * Finally unfreeze refcount. Additional pin to radix tree.
> + */
> + page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
> + PageSwapCache(head)));
Please say "Additional pin from page cache", not "radix tree".
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze
2018-02-11 13:13 ` Konstantin Khlebnikov
2018-02-11 14:29 ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
@ 2018-02-12 13:58 ` Konstantin Khlebnikov
2018-02-12 14:07 ` Kirill A. Shutemov
2018-02-12 13:58 ` [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
2 siblings, 1 reply; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-12 13:58 UTC (permalink / raw)
To: linux-mm, Andrew Morton, linux-kernel
Cc: Michal Hocko, Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin
page_ref_unfreeze() has exactly that semantic. No functional
changes: just minus one barrier and proper handling of PPro errata.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
include/linux/page_ref.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 760d74a0e9a9..14d14beb1f7f 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -175,8 +175,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
VM_BUG_ON_PAGE(page_count(page) != 0, page);
VM_BUG_ON(count == 0);
- smp_mb();
- atomic_set(&page->_refcount, count);
+ atomic_set_release(&page->_refcount, count);
if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
__page_ref_unfreeze(page, count);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze
2018-02-12 13:58 ` [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze Konstantin Khlebnikov
@ 2018-02-12 14:07 ` Kirill A. Shutemov
0 siblings, 0 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-12 14:07 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin
On Mon, Feb 12, 2018 at 04:58:50PM +0300, Konstantin Khlebnikov wrote:
> page_ref_unfreeze() has exactly that semantic. No functional
> changes: just minus one barrier and proper handling of PPro errata.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
2018-02-11 13:13 ` Konstantin Khlebnikov
2018-02-11 14:29 ` [PATCH v2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
2018-02-12 13:58 ` [PATCH v3 1/2] mm/page_ref: use atomic_set_release in page_ref_unfreeze Konstantin Khlebnikov
@ 2018-02-12 13:58 ` Konstantin Khlebnikov
2018-02-12 14:11 ` Kirill A. Shutemov
2 siblings, 1 reply; 13+ messages in thread
From: Konstantin Khlebnikov @ 2018-02-12 13:58 UTC (permalink / raw)
To: linux-mm, Andrew Morton, linux-kernel
Cc: Michal Hocko, Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin
THP split makes non-atomic change of tail page flags. This is almost ok
because tail pages are locked and isolated but this breaks recent changes
in page locking: non-atomic operation could clear bit PG_waiters.
As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever. Especially if this page was truncated later.
Fix is trivial: clone flags before unfreezing page reference counter.
This race exists since commit 62906027091f ("mm: add PageWaiters indicating
tasks are waiting for a page bit") while unsave unfreeze itself was added
in commit 8df651c7059e ("thp: cleanup split_huge_page()").
clear_compound_head() also must be called before unfreezing page reference
because after successful get_page_unless_zero() might follow put_page()
which needs correct compound_head().
And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
is made especially for that and has semantic of smp_store_release().
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
mm/huge_memory.c | 36 +++++++++++++++---------------------
1 file changed, 15 insertions(+), 21 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87ab9b8f56b5..7a8ead256b2f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2355,26 +2355,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
struct page *page_tail = head + tail;
VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
- VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail);
/*
- * tail_page->_refcount is zero and not changing from under us. But
- * get_page_unless_zero() may be running from under us on the
- * tail_page. If we used atomic_set() below instead of atomic_inc() or
- * atomic_add(), we would then run atomic_set() concurrently with
- * get_page_unless_zero(), and atomic_set() is implemented in C not
- * using locked ops. spin_unlock on x86 sometime uses locked ops
- * because of PPro errata 66, 92, so unless somebody can guarantee
- * atomic_set() here would be safe on all archs (and not only on x86),
- * it's safer to use atomic_inc()/atomic_add().
+ * Clone page flags before unfreezing refcount.
+ *
+ * After successful get_page_unless_zero() might follow flags change,
+ * for exmaple lock_page() which set PG_waiters.
*/
- if (PageAnon(head) && !PageSwapCache(head)) {
- page_ref_inc(page_tail);
- } else {
- /* Additional pin to radix tree */
- page_ref_add(page_tail, 2);
- }
-
page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
page_tail->flags |= (head->flags &
((1L << PG_referenced) |
@@ -2387,14 +2374,21 @@ static void __split_huge_page_tail(struct page *head, int tail,
(1L << PG_unevictable) |
(1L << PG_dirty)));
- /*
- * After clearing PageTail the gup refcount can be released.
- * Page flags also must be visible before we make the page non-compound.
- */
+ /* Page flags must be visible before we make the page non-compound. */
smp_wmb();
+ /*
+ * Clear PageTail before unfreezing page refcount.
+ *
+ * After successful get_page_unless_zero() might follow put_page()
+ * which needs correct compound_head().
+ */
clear_compound_head(page_tail);
+ /* Finally unfreeze refcount. Additional reference from page cache. */
+ page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
+ PageSwapCache(head)));
+
if (page_is_young(head))
set_page_young(page_tail);
if (page_is_idle(head))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail()
2018-02-12 13:58 ` [PATCH v3 2/2] mm/huge_memory.c: reorder operations in __split_huge_page_tail() Konstantin Khlebnikov
@ 2018-02-12 14:11 ` Kirill A. Shutemov
0 siblings, 0 replies; 13+ messages in thread
From: Kirill A. Shutemov @ 2018-02-12 14:11 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
Linus Torvalds, Kirill A. Shutemov, Nicholas Piggin
On Mon, Feb 12, 2018 at 04:58:53PM +0300, Konstantin Khlebnikov wrote:
> THP split makes non-atomic change of tail page flags. This is almost ok
> because tail pages are locked and isolated but this breaks recent changes
> in page locking: non-atomic operation could clear bit PG_waiters.
>
> As a result concurrent sequence get_page_unless_zero() -> lock_page()
> might block forever. Especially if this page was truncated later.
>
> Fix is trivial: clone flags before unfreezing page reference counter.
>
> This race exists since commit 62906027091f ("mm: add PageWaiters indicating
> tasks are waiting for a page bit") while unsave unfreeze itself was added
> in commit 8df651c7059e ("thp: cleanup split_huge_page()").
>
> clear_compound_head() also must be called before unfreezing page reference
> because after successful get_page_unless_zero() might follow put_page()
> which needs correct compound_head().
>
> And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
> is made especially for that and has semantic of smp_store_release().
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread