[RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
@ 2017-11-03  7:52 ` Huang, Ying
  0 siblings, 0 replies; 14+ messages in thread
From: Huang, Ying @ 2017-11-03  7:52 UTC (permalink / raw)
  To: Naoya Horiguchi, Zi Yan
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

From: Huang Ying <ying.huang@intel.com>

If THP migration is enabled, the following situation is possible,

- A THP is mapped at source address
- Migration is started to move the THP to another node
- Page fault occurs
- The PMD (migration entry) is copied to the destination address in mremap

That is, it is possible for handle_userfault() encounter a PMD entry
which has been handled but !pmd_present().  In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.

This is fixed via avoiding to wait when !pmd_present(), only wait when
pmd_none().

Question:

I found userfaultfd_must_wait() is always called when PMD or PTE is
none, and with mm->mmap_sem read-lock held.  mremap() will write-lock
mm->mmap_sem.  And UFFDIO_COPY don't support to copy THP mapping.  So
the situation described above couldn't happen in practice?

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.UK>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/userfaultfd.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b5a0193e1960..0fcf66c3e439 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -294,10 +294,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	 * pmd_trans_unstable) of the pmd.
 	 */
 	_pmd = READ_ONCE(*pmd);
-	if (!pmd_present(_pmd))
+	if (pmd_none(_pmd))
 		goto out;
 
 	ret = false;
+	if (!pmd_present(_pmd))
+		goto out;
+
 	if (pmd_trans_huge(_pmd))
 		goto out;
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
@ 2017-11-03  7:52 ` Huang, Ying
  0 siblings, 0 replies; 14+ messages in thread
From: Huang, Ying @ 2017-11-03  7:52 UTC (permalink / raw)
  To: Naoya Horiguchi, Zi Yan
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

From: Huang Ying <ying.huang@intel.com>

If THP migration is enabled, the following situation is possible,

- A THP is mapped at source address
- Migration is started to move the THP to another node
- Page fault occurs
- The PMD (migration entry) is copied to the destination address in mremap

That is, it is possible for handle_userfault() encounter a PMD entry
which has been handled but !pmd_present().  In the current
implementation, we will wait for such PMD entries, which may cause
unnecessary waiting, and potential soft lockup.

This is fixed via avoiding to wait when !pmd_present(), only wait when
pmd_none().

Question:

I found userfaultfd_must_wait() is always called when PMD or PTE is
none, and with mm->mmap_sem read-lock held.  mremap() will write-lock
mm->mmap_sem.  And UFFDIO_COPY don't support to copy THP mapping.  So
the situation described above couldn't happen in practice?

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.UK>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/userfaultfd.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b5a0193e1960..0fcf66c3e439 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -294,10 +294,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	 * pmd_trans_unstable) of the pmd.
 	 */
 	_pmd = READ_ONCE(*pmd);
-	if (!pmd_present(_pmd))
+	if (pmd_none(_pmd))
 		goto out;
 
 	ret = false;
+	if (!pmd_present(_pmd))
+		goto out;
+
 	if (pmd_trans_huge(_pmd))
 		goto out;
 
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
  2017-11-03  7:52 ` Huang, Ying
  (?)
@ 2017-11-03 15:00 ` Zi Yan
  2017-11-05  3:01     ` huang ying
  -1 siblings, 1 reply; 14+ messages in thread
From: Zi Yan @ 2017-11-03 15:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Naoya Horiguchi, linux-mm, linux-kernel, Andrea Arcangeli,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

[-- Attachment #1: Type: text/plain, Size: 1117 bytes --]

On 3 Nov 2017, at 3:52, Huang, Ying wrote:

> From: Huang Ying <ying.huang@intel.com>
>
> If THP migration is enabled, the following situation is possible,
>
> - A THP is mapped at source address
> - Migration is started to move the THP to another node
> - Page fault occurs
> - The PMD (migration entry) is copied to the destination address in mremap
>

You mean the page fault path follows the source address and sees pmd_none() now
because mremap() clears it and remaps the page with dest address.
Otherwise, it seems not possible to get into handle_userfault(), since it is called in
pmd_none() branch inside do_huge_pmd_anonymous_page().


> That is, it is possible for handle_userfault() encounter a PMD entry
> which has been handled but !pmd_present().  In the current
> implementation, we will wait for such PMD entries, which may cause
> unnecessary waiting, and potential soft lockup.

handle_userfault() should only see pmd_none() in the situation you describe,
whereas !pmd_present() (migration entry case) should lead to
pmd_migration_entry_wait().

Am I missing anything here?


--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
  2017-11-03 15:00 ` Zi Yan
@ 2017-11-05  3:01     ` huang ying
  0 siblings, 0 replies; 14+ messages in thread
From: huang ying @ 2017-11-05  3:01 UTC (permalink / raw)
  To: Zi Yan
  Cc: Huang, Ying, Naoya Horiguchi, linux-mm, LKML, Andrea Arcangeli,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

On Fri, Nov 3, 2017 at 11:00 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
> On 3 Nov 2017, at 3:52, Huang, Ying wrote:
>
>> From: Huang Ying <ying.huang@intel.com>
>>
>> If THP migration is enabled, the following situation is possible,
>>
>> - A THP is mapped at source address
>> - Migration is started to move the THP to another node
>> - Page fault occurs
>> - The PMD (migration entry) is copied to the destination address in mremap
>>
>
> You mean the page fault path follows the source address and sees pmd_none() now
> because mremap() clears it and remaps the page with dest address.
> Otherwise, it seems not possible to get into handle_userfault(), since it is called in
> pmd_none() branch inside do_huge_pmd_anonymous_page().
>
>
>> That is, it is possible for handle_userfault() encounter a PMD entry
>> which has been handled but !pmd_present().  In the current
>> implementation, we will wait for such PMD entries, which may cause
>> unnecessary waiting, and potential soft lockup.
>
> handle_userfault() should only see pmd_none() in the situation you describe,
> whereas !pmd_present() (migration entry case) should lead to
> pmd_migration_entry_wait().

Yes.  This is my understanding of the source code too.  And I
described it in the original patch description too.  I just want to
make sure whether it is possible that !pmd_none() and !pmd_present()
for a PMD in userfaultfd_must_wait().  And, whether it is possible for
us to implement PMD mapping copying in UFFDIO_COPY in the future?

Best Regards,
Huang, Ying

> Am I missing anything here?
>
>
> --
> Best Regards
> Yan Zi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
@ 2017-11-05  3:01     ` huang ying
  0 siblings, 0 replies; 14+ messages in thread
From: huang ying @ 2017-11-05  3:01 UTC (permalink / raw)
  To: Zi Yan
  Cc: Huang, Ying, Naoya Horiguchi, linux-mm, LKML, Andrea Arcangeli,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

On Fri, Nov 3, 2017 at 11:00 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
> On 3 Nov 2017, at 3:52, Huang, Ying wrote:
>
>> From: Huang Ying <ying.huang@intel.com>
>>
>> If THP migration is enabled, the following situation is possible,
>>
>> - A THP is mapped at source address
>> - Migration is started to move the THP to another node
>> - Page fault occurs
>> - The PMD (migration entry) is copied to the destination address in mremap
>>
>
> You mean the page fault path follows the source address and sees pmd_none() now
> because mremap() clears it and remaps the page with dest address.
> Otherwise, it seems not possible to get into handle_userfault(), since it is called in
> pmd_none() branch inside do_huge_pmd_anonymous_page().
>
>
>> That is, it is possible for handle_userfault() encounter a PMD entry
>> which has been handled but !pmd_present().  In the current
>> implementation, we will wait for such PMD entries, which may cause
>> unnecessary waiting, and potential soft lockup.
>
> handle_userfault() should only see pmd_none() in the situation you describe,
> whereas !pmd_present() (migration entry case) should lead to
> pmd_migration_entry_wait().

Yes.  This is my understanding of the source code too.  And I
described it in the original patch description too.  I just want to
make sure whether it is possible that !pmd_none() and !pmd_present()
for a PMD in userfaultfd_must_wait().  And, whether it is possible for
us to implement PMD mapping copying in UFFDIO_COPY in the future?

Best Regards,
Huang, Ying

> Am I missing anything here?
>
>
> --
> Best Regards
> Yan Zi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
  2017-11-05  3:01     ` huang ying
@ 2017-11-06 15:53       ` Zi Yan
  -1 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2017-11-06 15:53 UTC (permalink / raw)
  To: huang ying
  Cc: Huang, Ying, Naoya Horiguchi, linux-mm, LKML, Andrea Arcangeli,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

On 4 Nov 2017, at 23:01, huang ying wrote:

> On Fri, Nov 3, 2017 at 11:00 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
>> On 3 Nov 2017, at 3:52, Huang, Ying wrote:
>>
>>> From: Huang Ying <ying.huang@intel.com>
>>>
>>> If THP migration is enabled, the following situation is possible,
>>>
>>> - A THP is mapped at source address
>>> - Migration is started to move the THP to another node
>>> - Page fault occurs
>>> - The PMD (migration entry) is copied to the destination address in 
>>> mremap
>>>
>>
>> You mean the page fault path follows the source address and sees 
>> pmd_none() now
>> because mremap() clears it and remaps the page with dest address.
>> Otherwise, it seems not possible to get into handle_userfault(), 
>> since it is called in
>> pmd_none() branch inside do_huge_pmd_anonymous_page().
>>
>>
>>> That is, it is possible for handle_userfault() encounter a PMD entry
>>> which has been handled but !pmd_present().  In the current
>>> implementation, we will wait for such PMD entries, which may cause
>>> unnecessary waiting, and potential soft lockup.
>>
>> handle_userfault() should only see pmd_none() in the situation you 
>> describe,
>> whereas !pmd_present() (migration entry case) should lead to
>> pmd_migration_entry_wait().
>
> Yes.  This is my understanding of the source code too.  And I
> described it in the original patch description too.  I just want to
> make sure whether it is possible that !pmd_none() and !pmd_present()
> for a PMD in userfaultfd_must_wait().  And, whether it is possible for
> us to implement PMD mapping copying in UFFDIO_COPY in the future?
>

Thanks for clarifying it. We both agree that !pmd_present(), which means
PMD migration entry, does not get into userfaultfd_must_wait(),
then there seems to be no issue with current code yet.

However, the if (!pmd_present(_pmd)) in userfaultfd_must_wait() does not 
match
the exact condition. How about the patch below? It can catch pmd 
migration entries,
which are only possible in x86_64 at the moment.

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1c713fd5b3e6..dda25444a6ee 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -294,9 +294,11 @@ static inline bool userfaultfd_must_wait(struct 
userfaultfd_ctx *ctx,
          * pmd_trans_unstable) of the pmd.
          */
         _pmd = READ_ONCE(*pmd);
-       if (!pmd_present(_pmd))
+       if (pmd_none(_pmd))
                 goto out;

+       VM_BUG_ON(thp_migration_supported() && 
is_pmd_migration_entry(_pmd));
+
         ret = false;
         if (pmd_trans_huge(_pmd))
                 goto out;



—
Best Regards,
Yan Zi

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
@ 2017-11-06 15:53       ` Zi Yan
  0 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2017-11-06 15:53 UTC (permalink / raw)
  To: huang ying
  Cc: Huang, Ying, Naoya Horiguchi, linux-mm, LKML, Andrea Arcangeli,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

On 4 Nov 2017, at 23:01, huang ying wrote:

> On Fri, Nov 3, 2017 at 11:00 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
>> On 3 Nov 2017, at 3:52, Huang, Ying wrote:
>>
>>> From: Huang Ying <ying.huang@intel.com>
>>>
>>> If THP migration is enabled, the following situation is possible,
>>>
>>> - A THP is mapped at source address
>>> - Migration is started to move the THP to another node
>>> - Page fault occurs
>>> - The PMD (migration entry) is copied to the destination address in 
>>> mremap
>>>
>>
>> You mean the page fault path follows the source address and sees 
>> pmd_none() now
>> because mremap() clears it and remaps the page with dest address.
>> Otherwise, it seems not possible to get into handle_userfault(), 
>> since it is called in
>> pmd_none() branch inside do_huge_pmd_anonymous_page().
>>
>>
>>> That is, it is possible for handle_userfault() encounter a PMD entry
>>> which has been handled but !pmd_present().  In the current
>>> implementation, we will wait for such PMD entries, which may cause
>>> unnecessary waiting, and potential soft lockup.
>>
>> handle_userfault() should only see pmd_none() in the situation you 
>> describe,
>> whereas !pmd_present() (migration entry case) should lead to
>> pmd_migration_entry_wait().
>
> Yes.  This is my understanding of the source code too.  And I
> described it in the original patch description too.  I just want to
> make sure whether it is possible that !pmd_none() and !pmd_present()
> for a PMD in userfaultfd_must_wait().  And, whether it is possible for
> us to implement PMD mapping copying in UFFDIO_COPY in the future?
>

Thanks for clarifying it. We both agree that !pmd_present(), which means
PMD migration entry, does not get into userfaultfd_must_wait(),
then there seems to be no issue with current code yet.

However, the if (!pmd_present(_pmd)) in userfaultfd_must_wait() does not 
match
the exact condition. How about the patch below? It can catch pmd 
migration entries,
which are only possible in x86_64 at the moment.

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1c713fd5b3e6..dda25444a6ee 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -294,9 +294,11 @@ static inline bool userfaultfd_must_wait(struct 
userfaultfd_ctx *ctx,
          * pmd_trans_unstable) of the pmd.
          */
         _pmd = READ_ONCE(*pmd);
-       if (!pmd_present(_pmd))
+       if (pmd_none(_pmd))
                 goto out;

+       VM_BUG_ON(thp_migration_supported() && 
is_pmd_migration_entry(_pmd));
+
         ret = false;
         if (pmd_trans_huge(_pmd))
                 goto out;



a??
Best Regards,
Yan Zi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
  2017-11-05  3:01     ` huang ying
@ 2017-11-06 20:21       ` Andrea Arcangeli
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2017-11-06 20:21 UTC (permalink / raw)
  To: huang ying
  Cc: Zi Yan, Huang, Ying, Naoya Horiguchi, linux-mm, LKML,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

Hello,

On Sun, Nov 05, 2017 at 11:01:05AM +0800, huang ying wrote:
> On Fri, Nov 3, 2017 at 11:00 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
> > On 3 Nov 2017, at 3:52, Huang, Ying wrote:
> >
> >> From: Huang Ying <ying.huang@intel.com>
> >>
> >> If THP migration is enabled, the following situation is possible,
> >>
> >> - A THP is mapped at source address
> >> - Migration is started to move the THP to another node
> >> - Page fault occurs
> >> - The PMD (migration entry) is copied to the destination address in mremap
> >>
> >
> > You mean the page fault path follows the source address and sees pmd_none() now
> > because mremap() clears it and remaps the page with dest address.
> > Otherwise, it seems not possible to get into handle_userfault(), since it is called in
> > pmd_none() branch inside do_huge_pmd_anonymous_page().
> >
> >
> >> That is, it is possible for handle_userfault() encounter a PMD entry
> >> which has been handled but !pmd_present().  In the current
> >> implementation, we will wait for such PMD entries, which may cause
> >> unnecessary waiting, and potential soft lockup.
> >
> > handle_userfault() should only see pmd_none() in the situation you describe,
> > whereas !pmd_present() (migration entry case) should lead to
> > pmd_migration_entry_wait().
> 
> Yes.  This is my understanding of the source code too.  And I
> described it in the original patch description too.  I just want to
> make sure whether it is possible that !pmd_none() and !pmd_present()
> for a PMD in userfaultfd_must_wait().  And, whether it is possible for

I don't see how mremap is relevant above. mremap runs with mmap_sem
for writing, so it can't race against userfaultfd_must_wait.

However the concern of set_pmd_migration_entry() being called with
only the mmap_sem for reading through TTU_MIGRATION in
__unmap_and_move and being interpreted as a "missing" THP page by
userfaultfd_must_wait seems valid.

Compaction won't normally compact pages that are already THP sized so
you cannot see this normally because VM don't normally get migrated
over SHM/hugetlbfs with hard bindings while userfaults are in
progress.

Overall your patch looks more correct than current code so it's good
idea to apply and it should avoid surprises with the above corner
case if CONFIG_ARCH_ENABLE_THP_MIGRATION is set.

Worst case the process would hang in handle_userfault(), but it will
still respond fine to sigkill, so it's not concerning, but it should
be fixed nevertheless.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

> us to implement PMD mapping copying in UFFDIO_COPY in the future?

That's definitely good idea to add too. We don't have an userland
model for THP yet in QEMU (so it wouldn't be making a difference right
now), we have it for the hugetlbfs-only case though. It'd be nice to
add a THP model and to have an option to do the faults at 2M
granularity also on anon and SHM memory (not just with hugetlbfs).

With userfaults the granularity of the fault is entirely decided by
userland. The kernel can then map a THP directly into the destination
if the granularity userland uses is 2M. The 8k user fault granularity
would also be feasible on x86, but it won't provide any TLB benefits,
while the 2M granularity will (after the kernel optimization you're
asking about). So it should be an ideal faetu.

I tried to defer the complexity to the point it could provide a
runtime payoff and until we tested userfaults at 2M granularity we
wouldn't know for sure how it would behave. Now we run userfaults on
hugetlbfs in production and so by now know the latency of those 2M
transfers over network is acceptable and the live migration runs
slightly faster overall. All goes as expected at runtime, so in
principle the THP model with anon/SHM THP should be a good tradeoff
too. Note that it will only work well with the fastest network
bandwidth available. Legacy gigabit likely wants to stay at current 4k
granularity so the default should probably stick to 4k userfault
granularity to avoid having to deal with unexpected higher latencies.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
@ 2017-11-06 20:21       ` Andrea Arcangeli
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2017-11-06 20:21 UTC (permalink / raw)
  To: huang ying
  Cc: Zi Yan, Huang, Ying, Naoya Horiguchi, linux-mm, LKML,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

Hello,

On Sun, Nov 05, 2017 at 11:01:05AM +0800, huang ying wrote:
> On Fri, Nov 3, 2017 at 11:00 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
> > On 3 Nov 2017, at 3:52, Huang, Ying wrote:
> >
> >> From: Huang Ying <ying.huang@intel.com>
> >>
> >> If THP migration is enabled, the following situation is possible,
> >>
> >> - A THP is mapped at source address
> >> - Migration is started to move the THP to another node
> >> - Page fault occurs
> >> - The PMD (migration entry) is copied to the destination address in mremap
> >>
> >
> > You mean the page fault path follows the source address and sees pmd_none() now
> > because mremap() clears it and remaps the page with dest address.
> > Otherwise, it seems not possible to get into handle_userfault(), since it is called in
> > pmd_none() branch inside do_huge_pmd_anonymous_page().
> >
> >
> >> That is, it is possible for handle_userfault() encounter a PMD entry
> >> which has been handled but !pmd_present().  In the current
> >> implementation, we will wait for such PMD entries, which may cause
> >> unnecessary waiting, and potential soft lockup.
> >
> > handle_userfault() should only see pmd_none() in the situation you describe,
> > whereas !pmd_present() (migration entry case) should lead to
> > pmd_migration_entry_wait().
> 
> Yes.  This is my understanding of the source code too.  And I
> described it in the original patch description too.  I just want to
> make sure whether it is possible that !pmd_none() and !pmd_present()
> for a PMD in userfaultfd_must_wait().  And, whether it is possible for

I don't see how mremap is relevant above. mremap runs with mmap_sem
for writing, so it can't race against userfaultfd_must_wait.

However the concern of set_pmd_migration_entry() being called with
only the mmap_sem for reading through TTU_MIGRATION in
__unmap_and_move and being interpreted as a "missing" THP page by
userfaultfd_must_wait seems valid.

Compaction won't normally compact pages that are already THP sized so
you cannot see this normally because VM don't normally get migrated
over SHM/hugetlbfs with hard bindings while userfaults are in
progress.

Overall your patch looks more correct than current code so it's good
idea to apply and it should avoid surprises with the above corner
case if CONFIG_ARCH_ENABLE_THP_MIGRATION is set.

Worst case the process would hang in handle_userfault(), but it will
still respond fine to sigkill, so it's not concerning, but it should
be fixed nevertheless.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

> us to implement PMD mapping copying in UFFDIO_COPY in the future?

That's definitely good idea to add too. We don't have an userland
model for THP yet in QEMU (so it wouldn't be making a difference right
now), we have it for the hugetlbfs-only case though. It'd be nice to
add a THP model and to have an option to do the faults at 2M
granularity also on anon and SHM memory (not just with hugetlbfs).

With userfaults the granularity of the fault is entirely decided by
userland. The kernel can then map a THP directly into the destination
if the granularity userland uses is 2M. The 8k user fault granularity
would also be feasible on x86, but it won't provide any TLB benefits,
while the 2M granularity will (after the kernel optimization you're
asking about). So it should be an ideal faetu.

I tried to defer the complexity to the point it could provide a
runtime payoff and until we tested userfaults at 2M granularity we
wouldn't know for sure how it would behave. Now we run userfaults on
hugetlbfs in production and so by now know the latency of those 2M
transfers over network is acceptable and the live migration runs
slightly faster overall. All goes as expected at runtime, so in
principle the THP model with anon/SHM THP should be a good tradeoff
too. Note that it will only work well with the fastest network
bandwidth available. Legacy gigabit likely wants to stay at current 4k
granularity so the default should probably stick to 4k userfault
granularity to avoid having to deal with unexpected higher latencies.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
  2017-11-06 15:53       ` Zi Yan
@ 2017-11-06 20:35         ` Andrea Arcangeli
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2017-11-06 20:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: huang ying, Huang, Ying, Naoya Horiguchi, linux-mm, LKML,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

On Mon, Nov 06, 2017 at 10:53:48AM -0500, Zi Yan wrote:
> Thanks for clarifying it. We both agree that !pmd_present(), which means
> PMD migration entry, does not get into userfaultfd_must_wait(),
> then there seems to be no issue with current code yet.
> 
> However, the if (!pmd_present(_pmd)) in userfaultfd_must_wait() does not 
> match
> the exact condition. How about the patch below? It can catch pmd 
> migration entries,
> which are only possible in x86_64 at the moment.
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 1c713fd5b3e6..dda25444a6ee 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -294,9 +294,11 @@ static inline bool userfaultfd_must_wait(struct 
> userfaultfd_ctx *ctx,
>           * pmd_trans_unstable) of the pmd.
>           */
>          _pmd = READ_ONCE(*pmd);
> -       if (!pmd_present(_pmd))
> +       if (pmd_none(_pmd))
>                  goto out;
> 
> +       VM_BUG_ON(thp_migration_supported() && is_pmd_migration_entry(_pmd));
> +

As I wrote in prev email I'm not sure about this invariant to be
correct 100% of the time (plus we'd want a VM_WARN_ON only
here). Specifically, what does prevent try_to_unmap to run on a THP
backed mapping with only the mmap_sem for reading?

I know what prevents to ever reproduce this in practice though (aside
from the fact the race between the is_swap_pmd() check in the main
page fault and the above check is small) and it's because compaction
won't migrate THP and even the numa faults will not use the migration
entry. So it'd require some more explicit migration numactl while
userfaults are running to ever see an hang in there.

I think it's a regression since the introduction of THP migration
around commits 84c3fc4e9c563d8fb91cfdf5948da48fe1af34d3 /
616b8371539a6c487404c3b8fb04078016dab4ba /
9c670ea37947a82cb6d4df69139f7e46ed71a0ac etc.. before that pmd_none or
!pmd_present used to be equivalent, not the case any longer. Of course
pmd_none would have been better before too.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
@ 2017-11-06 20:35         ` Andrea Arcangeli
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2017-11-06 20:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: huang ying, Huang, Ying, Naoya Horiguchi, linux-mm, LKML,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

On Mon, Nov 06, 2017 at 10:53:48AM -0500, Zi Yan wrote:
> Thanks for clarifying it. We both agree that !pmd_present(), which means
> PMD migration entry, does not get into userfaultfd_must_wait(),
> then there seems to be no issue with current code yet.
> 
> However, the if (!pmd_present(_pmd)) in userfaultfd_must_wait() does not 
> match
> the exact condition. How about the patch below? It can catch pmd 
> migration entries,
> which are only possible in x86_64 at the moment.
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 1c713fd5b3e6..dda25444a6ee 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -294,9 +294,11 @@ static inline bool userfaultfd_must_wait(struct 
> userfaultfd_ctx *ctx,
>           * pmd_trans_unstable) of the pmd.
>           */
>          _pmd = READ_ONCE(*pmd);
> -       if (!pmd_present(_pmd))
> +       if (pmd_none(_pmd))
>                  goto out;
> 
> +       VM_BUG_ON(thp_migration_supported() && is_pmd_migration_entry(_pmd));
> +

As I wrote in prev email I'm not sure about this invariant to be
correct 100% of the time (plus we'd want a VM_WARN_ON only
here). Specifically, what does prevent try_to_unmap to run on a THP
backed mapping with only the mmap_sem for reading?

I know what prevents to ever reproduce this in practice though (aside
from the fact the race between the is_swap_pmd() check in the main
page fault and the above check is small) and it's because compaction
won't migrate THP and even the numa faults will not use the migration
entry. So it'd require some more explicit migration numactl while
userfaults are running to ever see an hang in there.

I think it's a regression since the introduction of THP migration
around commits 84c3fc4e9c563d8fb91cfdf5948da48fe1af34d3 /
616b8371539a6c487404c3b8fb04078016dab4ba /
9c670ea37947a82cb6d4df69139f7e46ed71a0ac etc.. before that pmd_none or
!pmd_present used to be equivalent, not the case any longer. Of course
pmd_none would have been better before too.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
  2017-11-06 20:35         ` Andrea Arcangeli
  (?)
@ 2017-11-07  2:30         ` Zi Yan
  -1 siblings, 0 replies; 14+ messages in thread
From: Zi Yan @ 2017-11-07  2:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: huang ying, Huang, Ying, Naoya Horiguchi, linux-mm, LKML,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

[-- Attachment #1: Type: text/plain, Size: 2589 bytes --]

On 6 Nov 2017, at 15:35, Andrea Arcangeli wrote:

> On Mon, Nov 06, 2017 at 10:53:48AM -0500, Zi Yan wrote:
>> Thanks for clarifying it. We both agree that !pmd_present(), which means
>> PMD migration entry, does not get into userfaultfd_must_wait(),
>> then there seems to be no issue with current code yet.
>>
>> However, the if (!pmd_present(_pmd)) in userfaultfd_must_wait() does not
>> match
>> the exact condition. How about the patch below? It can catch pmd
>> migration entries,
>> which are only possible in x86_64 at the moment.
>>
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 1c713fd5b3e6..dda25444a6ee 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -294,9 +294,11 @@ static inline bool userfaultfd_must_wait(struct
>> userfaultfd_ctx *ctx,
>>           * pmd_trans_unstable) of the pmd.
>>           */
>>          _pmd = READ_ONCE(*pmd);
>> -       if (!pmd_present(_pmd))
>> +       if (pmd_none(_pmd))
>>                  goto out;
>>
>> +       VM_BUG_ON(thp_migration_supported() && is_pmd_migration_entry(_pmd));
>> +
>
> As I wrote in prev email I'm not sure about this invariant to be
> correct 100% of the time (plus we'd want a VM_WARN_ON only
> here). Specifically, what does prevent try_to_unmap to run on a THP
> backed mapping with only the mmap_sem for reading?
>

Right. I missed the part that the page table lock is released before
entering handle_userfault(). The pmd_none() can be mapped elsewhere
and migrated, !pmd_present() but not pmd_none() is possible here when
THP migration is enabled.

> I know what prevents to ever reproduce this in practice though (aside
> from the fact the race between the is_swap_pmd() check in the main
> page fault and the above check is small) and it's because compaction
> won't migrate THP and even the numa faults will not use the migration
> entry. So it'd require some more explicit migration numactl while
> userfaults are running to ever see an hang in there.
>
> I think it's a regression since the introduction of THP migration
> around commits 84c3fc4e9c563d8fb91cfdf5948da48fe1af34d3 /
> 616b8371539a6c487404c3b8fb04078016dab4ba /
> 9c670ea37947a82cb6d4df69139f7e46ed71a0ac etc.. before that pmd_none or
> !pmd_present used to be equivalent, not the case any longer. Of course
> pmd_none would have been better before too.
>

Right. Ying’s patch is a fix of the regression.

Fixes: 84c3fc4e9c563 ("mm: thp: check pmd migration entry in common path")

Thanks for pointing all these out.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 557 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
  2017-11-06 20:21       ` Andrea Arcangeli
@ 2017-11-09  7:33         ` Huang, Ying
  -1 siblings, 0 replies; 14+ messages in thread
From: Huang, Ying @ 2017-11-09  7:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: huang ying, Zi Yan, Huang, Ying, Naoya Horiguchi, linux-mm, LKML,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

Andrea Arcangeli <aarcange@redhat.com> writes:

> Hello,
>
> On Sun, Nov 05, 2017 at 11:01:05AM +0800, huang ying wrote:
>> On Fri, Nov 3, 2017 at 11:00 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
>> > On 3 Nov 2017, at 3:52, Huang, Ying wrote:
>> >
>> >> From: Huang Ying <ying.huang@intel.com>
>> >>
>> >> If THP migration is enabled, the following situation is possible,
>> >>
>> >> - A THP is mapped at source address
>> >> - Migration is started to move the THP to another node
>> >> - Page fault occurs
>> >> - The PMD (migration entry) is copied to the destination address in mremap
>> >>
>> >
>> > You mean the page fault path follows the source address and sees pmd_none() now
>> > because mremap() clears it and remaps the page with dest address.
>> > Otherwise, it seems not possible to get into handle_userfault(), since it is called in
>> > pmd_none() branch inside do_huge_pmd_anonymous_page().
>> >
>> >
>> >> That is, it is possible for handle_userfault() encounter a PMD entry
>> >> which has been handled but !pmd_present().  In the current
>> >> implementation, we will wait for such PMD entries, which may cause
>> >> unnecessary waiting, and potential soft lockup.
>> >
>> > handle_userfault() should only see pmd_none() in the situation you describe,
>> > whereas !pmd_present() (migration entry case) should lead to
>> > pmd_migration_entry_wait().
>> 
>> Yes.  This is my understanding of the source code too.  And I
>> described it in the original patch description too.  I just want to
>> make sure whether it is possible that !pmd_none() and !pmd_present()
>> for a PMD in userfaultfd_must_wait().  And, whether it is possible for
>
> I don't see how mremap is relevant above. mremap runs with mmap_sem
> for writing, so it can't race against userfaultfd_must_wait.
>
> However the concern of set_pmd_migration_entry() being called with
> only the mmap_sem for reading through TTU_MIGRATION in
> __unmap_and_move and being interpreted as a "missing" THP page by
> userfaultfd_must_wait seems valid.
>
> Compaction won't normally compact pages that are already THP sized so
> you cannot see this normally because VM don't normally get migrated
> over SHM/hugetlbfs with hard bindings while userfaults are in
> progress.
>
> Overall your patch looks more correct than current code so it's good
> idea to apply and it should avoid surprises with the above corner
> case if CONFIG_ARCH_ENABLE_THP_MIGRATION is set.
>
> Worst case the process would hang in handle_userfault(), but it will
> still respond fine to sigkill, so it's not concerning, but it should
> be fixed nevertheless.
>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Thanks!  I will revise the patch description and send the new version!

Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration
@ 2017-11-09  7:33         ` Huang, Ying
  0 siblings, 0 replies; 14+ messages in thread
From: Huang, Ying @ 2017-11-09  7:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: huang ying, Zi Yan, Huang, Ying, Naoya Horiguchi, linux-mm, LKML,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Alexander Viro

Andrea Arcangeli <aarcange@redhat.com> writes:

> Hello,
>
> On Sun, Nov 05, 2017 at 11:01:05AM +0800, huang ying wrote:
>> On Fri, Nov 3, 2017 at 11:00 PM, Zi Yan <zi.yan@cs.rutgers.edu> wrote:
>> > On 3 Nov 2017, at 3:52, Huang, Ying wrote:
>> >
>> >> From: Huang Ying <ying.huang@intel.com>
>> >>
>> >> If THP migration is enabled, the following situation is possible,
>> >>
>> >> - A THP is mapped at source address
>> >> - Migration is started to move the THP to another node
>> >> - Page fault occurs
>> >> - The PMD (migration entry) is copied to the destination address in mremap
>> >>
>> >
>> > You mean the page fault path follows the source address and sees pmd_none() now
>> > because mremap() clears it and remaps the page with dest address.
>> > Otherwise, it seems not possible to get into handle_userfault(), since it is called in
>> > pmd_none() branch inside do_huge_pmd_anonymous_page().
>> >
>> >
>> >> That is, it is possible for handle_userfault() encounter a PMD entry
>> >> which has been handled but !pmd_present().  In the current
>> >> implementation, we will wait for such PMD entries, which may cause
>> >> unnecessary waiting, and potential soft lockup.
>> >
>> > handle_userfault() should only see pmd_none() in the situation you describe,
>> > whereas !pmd_present() (migration entry case) should lead to
>> > pmd_migration_entry_wait().
>> 
>> Yes.  This is my understanding of the source code too.  And I
>> described it in the original patch description too.  I just want to
>> make sure whether it is possible that !pmd_none() and !pmd_present()
>> for a PMD in userfaultfd_must_wait().  And, whether it is possible for
>
> I don't see how mremap is relevant above. mremap runs with mmap_sem
> for writing, so it can't race against userfaultfd_must_wait.
>
> However the concern of set_pmd_migration_entry() being called with
> only the mmap_sem for reading through TTU_MIGRATION in
> __unmap_and_move and being interpreted as a "missing" THP page by
> userfaultfd_must_wait seems valid.
>
> Compaction won't normally compact pages that are already THP sized so
> you cannot see this normally because VM don't normally get migrated
> over SHM/hugetlbfs with hard bindings while userfaults are in
> progress.
>
> Overall your patch looks more correct than current code so it's good
> idea to apply and it should avoid surprises with the above corner
> case if CONFIG_ARCH_ENABLE_THP_MIGRATION is set.
>
> Worst case the process would hang in handle_userfault(), but it will
> still respond fine to sigkill, so it's not concerning, but it should
> be fixed nevertheless.
>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Thanks!  I will revise the patch description and send the new version!

Best Regards,
Huang, Ying

[snip]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-11-09  7:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-03  7:52 [RFC -mm] mm, userfaultfd, THP: Avoid waiting when PMD under THP migration Huang, Ying
2017-11-03  7:52 ` Huang, Ying
2017-11-03 15:00 ` Zi Yan
2017-11-05  3:01   ` huang ying
2017-11-05  3:01     ` huang ying
2017-11-06 15:53     ` Zi Yan
2017-11-06 15:53       ` Zi Yan
2017-11-06 20:35       ` Andrea Arcangeli
2017-11-06 20:35         ` Andrea Arcangeli
2017-11-07  2:30         ` Zi Yan
2017-11-06 20:21     ` Andrea Arcangeli
2017-11-06 20:21       ` Andrea Arcangeli
2017-11-09  7:33       ` Huang, Ying
2017-11-09  7:33         ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.