* [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-18 14:54 ` Punit Agrawal
0 siblings, 0 replies; 42+ messages in thread
From: Punit Agrawal @ 2017-08-18 14:54 UTC (permalink / raw)
To: Andrew Morton
Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arch,
Catalin Marinas, Naoya Horiguchi, Steve Capper, Will Deacon,
Kirill A . Shutemov, Michal Hocko, Mike Kravetz
When walking the page tables to resolve an address that points to
!p*d_present() entry, huge_pte_offset() returns inconsistent values
depending on the level of page table (PUD or PMD).
It returns NULL in the case of a PUD entry while in the case of a PMD
entry, it returns a pointer to the page table entry.
A similar inconsitency exists when handling swap entries - returns NULL
for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
Update huge_pte_offset() to make the behaviour consistent - return a
pointer to the pte_t for hugepage or swap entries. Only return NULL in
instances where we have a p*d_none() entry and the size parameter
doesn't match the hugepage size at this level of the page table.
Document the behaviour to clarify the expected behaviour of this function.
This is to set clear semantics for architecture specific implementations
of huge_pte_offset().
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
Hi Andrew,
>From discussions on the arm64 implementation of huge_pte_offset()[0]
we realised that there is benefit from returning a pte_t* in the case
of p*d_none().
The fault handling code in hugetlb_fault() can handle p*d_none()
entries and saves an extra round trip to huge_pte_alloc(). Other
callers of huge_pte_offset() should be ok as well.
Apologies for sending a late update but I thought if we are defining
the semantics, it's worth getting them right.
Could you please pick this version please?
Thanks,
Punit
[0] http://www.spinics.net/lists/linux-mm/msg133699.html
v2:
mm/hugetlb.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31e207cb399b..1d54a131bdd5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
return pte;
}
+/*
+ * huge_pte_offset() - Walk the page table to resolve the hugepage
+ * entry at address @addr
+ *
+ * Return: Pointer to page table or swap entry (PUD or PMD) for
+ * address @addr, or NULL if a p*d_none() entry is encountered and the
+ * size @sz doesn't match the hugepage size at this level of the page
+ * table.
+ */
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz)
{
@@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
p4d = p4d_offset(pgd, addr);
if (!p4d_present(*p4d))
return NULL;
+
pud = pud_offset(p4d, addr);
- if (!pud_present(*pud))
+ if (sz != PUD_SIZE && pud_none(*pud))
return NULL;
- if (pud_huge(*pud))
+ /* hugepage or swap? */
+ if (pud_huge(*pud) || !pud_present(*pud))
return (pte_t *)pud;
+
pmd = pmd_offset(pud, addr);
- return (pte_t *) pmd;
+ if (sz != PMD_SIZE && pmd_none(*pmd))
+ return NULL;
+ /* hugepage or swap? */
+ if (pmd_huge(*pmd) || !pmd_present(*pmd))
+ return (pte_t *)pmd;
+
+ return NULL;
}
#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
--
2.13.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-18 14:54 ` Punit Agrawal
0 siblings, 0 replies; 42+ messages in thread
From: Punit Agrawal @ 2017-08-18 14:54 UTC (permalink / raw)
To: Andrew Morton
Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arch,
Catalin Marinas, Naoya Horiguchi, Steve Capper, Will Deacon,
Kirill A . Shutemov, Michal Hocko, Mike Kravetz
When walking the page tables to resolve an address that points to
!p*d_present() entry, huge_pte_offset() returns inconsistent values
depending on the level of page table (PUD or PMD).
It returns NULL in the case of a PUD entry while in the case of a PMD
entry, it returns a pointer to the page table entry.
A similar inconsitency exists when handling swap entries - returns NULL
for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
Update huge_pte_offset() to make the behaviour consistent - return a
pointer to the pte_t for hugepage or swap entries. Only return NULL in
instances where we have a p*d_none() entry and the size parameter
doesn't match the hugepage size at this level of the page table.
Document the behaviour to clarify the expected behaviour of this function.
This is to set clear semantics for architecture specific implementations
of huge_pte_offset().
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
Hi Andrew,
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-18 14:54 ` Punit Agrawal
0 siblings, 0 replies; 42+ messages in thread
From: Punit Agrawal @ 2017-08-18 14:54 UTC (permalink / raw)
To: Andrew Morton
Cc: Punit Agrawal, linux-mm, linux-kernel, linux-arch,
Catalin Marinas, Naoya Horiguchi, Steve Capper, Will Deacon,
Kirill A . Shutemov, Michal Hocko, Mike Kravetz
When walking the page tables to resolve an address that points to
!p*d_present() entry, huge_pte_offset() returns inconsistent values
depending on the level of page table (PUD or PMD).
It returns NULL in the case of a PUD entry while in the case of a PMD
entry, it returns a pointer to the page table entry.
A similar inconsitency exists when handling swap entries - returns NULL
for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
Update huge_pte_offset() to make the behaviour consistent - return a
pointer to the pte_t for hugepage or swap entries. Only return NULL in
instances where we have a p*d_none() entry and the size parameter
doesn't match the hugepage size at this level of the page table.
Document the behaviour to clarify the expected behaviour of this function.
This is to set clear semantics for architecture specific implementations
of huge_pte_offset().
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
Hi Andrew,
From discussions on the arm64 implementation of huge_pte_offset()[0]
we realised that there is benefit from returning a pte_t* in the case
of p*d_none().
The fault handling code in hugetlb_fault() can handle p*d_none()
entries and saves an extra round trip to huge_pte_alloc(). Other
callers of huge_pte_offset() should be ok as well.
Apologies for sending a late update but I thought if we are defining
the semantics, it's worth getting them right.
Could you please pick this version please?
Thanks,
Punit
[0] http://www.spinics.net/lists/linux-mm/msg133699.html
v2:
mm/hugetlb.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31e207cb399b..1d54a131bdd5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
return pte;
}
+/*
+ * huge_pte_offset() - Walk the page table to resolve the hugepage
+ * entry at address @addr
+ *
+ * Return: Pointer to page table or swap entry (PUD or PMD) for
+ * address @addr, or NULL if a p*d_none() entry is encountered and the
+ * size @sz doesn't match the hugepage size at this level of the page
+ * table.
+ */
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz)
{
@@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
p4d = p4d_offset(pgd, addr);
if (!p4d_present(*p4d))
return NULL;
+
pud = pud_offset(p4d, addr);
- if (!pud_present(*pud))
+ if (sz != PUD_SIZE && pud_none(*pud))
return NULL;
- if (pud_huge(*pud))
+ /* hugepage or swap? */
+ if (pud_huge(*pud) || !pud_present(*pud))
return (pte_t *)pud;
+
pmd = pmd_offset(pud, addr);
- return (pte_t *) pmd;
+ if (sz != PMD_SIZE && pmd_none(*pmd))
+ return NULL;
+ /* hugepage or swap? */
+ if (pmd_huge(*pmd) || !pmd_present(*pmd))
+ return (pte_t *)pmd;
+
+ return NULL;
}
#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
--
2.13.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
2017-08-18 14:54 ` Punit Agrawal
@ 2017-08-18 21:29 ` Mike Kravetz
-1 siblings, 0 replies; 42+ messages in thread
From: Mike Kravetz @ 2017-08-18 21:29 UTC (permalink / raw)
To: Punit Agrawal, Andrew Morton
Cc: linux-mm, linux-kernel, linux-arch, Catalin Marinas,
Naoya Horiguchi, Steve Capper, Will Deacon, Kirill A . Shutemov,
Michal Hocko
On 08/18/2017 07:54 AM, Punit Agrawal wrote:
> When walking the page tables to resolve an address that points to
> !p*d_present() entry, huge_pte_offset() returns inconsistent values
> depending on the level of page table (PUD or PMD).
>
> It returns NULL in the case of a PUD entry while in the case of a PMD
> entry, it returns a pointer to the page table entry.
>
> A similar inconsitency exists when handling swap entries - returns NULL
> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>
> Update huge_pte_offset() to make the behaviour consistent - return a
> pointer to the pte_t for hugepage or swap entries. Only return NULL in
> instances where we have a p*d_none() entry and the size parameter
> doesn't match the hugepage size at this level of the page table.
>
> Document the behaviour to clarify the expected behaviour of this function.
> This is to set clear semantics for architecture specific implementations
> of huge_pte_offset().
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>
> Hi Andrew,
>
> From discussions on the arm64 implementation of huge_pte_offset()[0]
> we realised that there is benefit from returning a pte_t* in the case
> of p*d_none().
>
> The fault handling code in hugetlb_fault() can handle p*d_none()
> entries and saves an extra round trip to huge_pte_alloc(). Other
> callers of huge_pte_offset() should be ok as well.
Yes, this change would eliminate that call to huge_pte_alloc() in
hugetlb_fault(). However, huge_pte_offset() is now returning a pointer
to a p*d_none() pte in some instances where it would have previously
returned NULL. Correct?
I went through the callers, and like you am fairly confident that they
can handle this situation. But, returning p*d_none() instead of NULL
does change the execution path in several routines such as
copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection,
and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these
routines, they do a quick continue, exit, etc. If they are returned
a pointer, they typically lock the page table(s) and then check for
p*d_none() before continuing, exiting, etc. So, it appears that these
routines could potentially slow down a bit with this change (in the specific
case of p*d_none).
I 'think' one could argue that the the fault case is more important. So,
the savings there would outweigh any potential slowdown in the other
routines.
IMO, this new version of the patch has more potential for issues than
the previous version. It would be helpful if others could take a look.
One thing I am still 'thinking' about is how this patch could potentially
change behavior in huge_pmd_share. With the patch, pmd sharing could
potentially be set up in situations (pmd_none) where it previously would
not have been set up. I don't think this is an issue, but any changes to
this concerns me.
--
Mike Kravetz
>
> Apologies for sending a late update but I thought if we are defining
> the semantics, it's worth getting them right.
>
> Could you please pick this version please?
>
> Thanks,
> Punit
>
> [0] http://www.spinics.net/lists/linux-mm/msg133699.html
>
> v2:
>
> mm/hugetlb.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 31e207cb399b..1d54a131bdd5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> return pte;
> }
>
> +/*
> + * huge_pte_offset() - Walk the page table to resolve the hugepage
> + * entry at address @addr
> + *
> + * Return: Pointer to page table or swap entry (PUD or PMD) for
> + * address @addr, or NULL if a p*d_none() entry is encountered and the
> + * size @sz doesn't match the hugepage size at this level of the page
> + * table.
> + */
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz)
> {
> @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> p4d = p4d_offset(pgd, addr);
> if (!p4d_present(*p4d))
> return NULL;
> +
> pud = pud_offset(p4d, addr);
> - if (!pud_present(*pud))
> + if (sz != PUD_SIZE && pud_none(*pud))
> return NULL;
> - if (pud_huge(*pud))
> + /* hugepage or swap? */
> + if (pud_huge(*pud) || !pud_present(*pud))
> return (pte_t *)pud;
> +
> pmd = pmd_offset(pud, addr);
> - return (pte_t *) pmd;
> + if (sz != PMD_SIZE && pmd_none(*pmd))
> + return NULL;
> + /* hugepage or swap? */
> + if (pmd_huge(*pmd) || !pmd_present(*pmd))
> + return (pte_t *)pmd;
> +
> + return NULL;
> }
>
> #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-18 21:29 ` Mike Kravetz
0 siblings, 0 replies; 42+ messages in thread
From: Mike Kravetz @ 2017-08-18 21:29 UTC (permalink / raw)
To: Punit Agrawal, Andrew Morton
Cc: linux-mm, linux-kernel, linux-arch, Catalin Marinas,
Naoya Horiguchi, Steve Capper, Will Deacon, Kirill A . Shutemov,
Michal Hocko
On 08/18/2017 07:54 AM, Punit Agrawal wrote:
> When walking the page tables to resolve an address that points to
> !p*d_present() entry, huge_pte_offset() returns inconsistent values
> depending on the level of page table (PUD or PMD).
>
> It returns NULL in the case of a PUD entry while in the case of a PMD
> entry, it returns a pointer to the page table entry.
>
> A similar inconsitency exists when handling swap entries - returns NULL
> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>
> Update huge_pte_offset() to make the behaviour consistent - return a
> pointer to the pte_t for hugepage or swap entries. Only return NULL in
> instances where we have a p*d_none() entry and the size parameter
> doesn't match the hugepage size at this level of the page table.
>
> Document the behaviour to clarify the expected behaviour of this function.
> This is to set clear semantics for architecture specific implementations
> of huge_pte_offset().
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>
> Hi Andrew,
>
> From discussions on the arm64 implementation of huge_pte_offset()[0]
> we realised that there is benefit from returning a pte_t* in the case
> of p*d_none().
>
> The fault handling code in hugetlb_fault() can handle p*d_none()
> entries and saves an extra round trip to huge_pte_alloc(). Other
> callers of huge_pte_offset() should be ok as well.
Yes, this change would eliminate that call to huge_pte_alloc() in
hugetlb_fault(). However, huge_pte_offset() is now returning a pointer
to a p*d_none() pte in some instances where it would have previously
returned NULL. Correct?
I went through the callers, and like you am fairly confident that they
can handle this situation. But, returning p*d_none() instead of NULL
does change the execution path in several routines such as
copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection,
and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these
routines, they do a quick continue, exit, etc. If they are returned
a pointer, they typically lock the page table(s) and then check for
p*d_none() before continuing, exiting, etc. So, it appears that these
routines could potentially slow down a bit with this change (in the specific
case of p*d_none).
I 'think' one could argue that the the fault case is more important. So,
the savings there would outweigh any potential slowdown in the other
routines.
IMO, this new version of the patch has more potential for issues than
the previous version. It would be helpful if others could take a look.
One thing I am still 'thinking' about is how this patch could potentially
change behavior in huge_pmd_share. With the patch, pmd sharing could
potentially be set up in situations (pmd_none) where it previously would
not have been set up. I don't think this is an issue, but any changes to
this concerns me.
--
Mike Kravetz
>
> Apologies for sending a late update but I thought if we are defining
> the semantics, it's worth getting them right.
>
> Could you please pick this version please?
>
> Thanks,
> Punit
>
> [0] http://www.spinics.net/lists/linux-mm/msg133699.html
>
> v2:
>
> mm/hugetlb.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 31e207cb399b..1d54a131bdd5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> return pte;
> }
>
> +/*
> + * huge_pte_offset() - Walk the page table to resolve the hugepage
> + * entry at address @addr
> + *
> + * Return: Pointer to page table or swap entry (PUD or PMD) for
> + * address @addr, or NULL if a p*d_none() entry is encountered and the
> + * size @sz doesn't match the hugepage size at this level of the page
> + * table.
> + */
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz)
> {
> @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> p4d = p4d_offset(pgd, addr);
> if (!p4d_present(*p4d))
> return NULL;
> +
> pud = pud_offset(p4d, addr);
> - if (!pud_present(*pud))
> + if (sz != PUD_SIZE && pud_none(*pud))
> return NULL;
> - if (pud_huge(*pud))
> + /* hugepage or swap? */
> + if (pud_huge(*pud) || !pud_present(*pud))
> return (pte_t *)pud;
> +
> pmd = pmd_offset(pud, addr);
> - return (pte_t *) pmd;
> + if (sz != PMD_SIZE && pmd_none(*pmd))
> + return NULL;
> + /* hugepage or swap? */
> + if (pmd_huge(*pmd) || !pmd_present(*pmd))
> + return (pte_t *)pmd;
> +
> + return NULL;
> }
>
> #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
2017-08-18 21:29 ` Mike Kravetz
@ 2017-08-21 18:07 ` Catalin Marinas
-1 siblings, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2017-08-21 18:07 UTC (permalink / raw)
To: Mike Kravetz
Cc: Punit Agrawal, Andrew Morton, linux-mm, linux-kernel, linux-arch,
Naoya Horiguchi, Steve Capper, Will Deacon, Kirill A . Shutemov,
Michal Hocko
On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote:
> On 08/18/2017 07:54 AM, Punit Agrawal wrote:
> > When walking the page tables to resolve an address that points to
> > !p*d_present() entry, huge_pte_offset() returns inconsistent values
> > depending on the level of page table (PUD or PMD).
> >
> > It returns NULL in the case of a PUD entry while in the case of a PMD
> > entry, it returns a pointer to the page table entry.
> >
> > A similar inconsitency exists when handling swap entries - returns NULL
> > for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
> >
> > Update huge_pte_offset() to make the behaviour consistent - return a
> > pointer to the pte_t for hugepage or swap entries. Only return NULL in
> > instances where we have a p*d_none() entry and the size parameter
> > doesn't match the hugepage size at this level of the page table.
> >
> > Document the behaviour to clarify the expected behaviour of this function.
> > This is to set clear semantics for architecture specific implementations
> > of huge_pte_offset().
> >
> > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Steve Capper <steve.capper@arm.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >
> > Hi Andrew,
> >
> > From discussions on the arm64 implementation of huge_pte_offset()[0]
> > we realised that there is benefit from returning a pte_t* in the case
> > of p*d_none().
> >
> > The fault handling code in hugetlb_fault() can handle p*d_none()
> > entries and saves an extra round trip to huge_pte_alloc(). Other
> > callers of huge_pte_offset() should be ok as well.
>
> Yes, this change would eliminate that call to huge_pte_alloc() in
> hugetlb_fault(). However, huge_pte_offset() is now returning a pointer
> to a p*d_none() pte in some instances where it would have previously
> returned NULL. Correct?
Yes (whether it was previously the right thing to return is a different
matter; that's what we are trying to clarify in the generic code so that
we can have similar semantics on arm64).
> I went through the callers, and like you am fairly confident that they
> can handle this situation. But, returning p*d_none() instead of NULL
> does change the execution path in several routines such as
> copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection,
> and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these
> routines, they do a quick continue, exit, etc. If they are returned
> a pointer, they typically lock the page table(s) and then check for
> p*d_none() before continuing, exiting, etc. So, it appears that these
> routines could potentially slow down a bit with this change (in the specific
> case of p*d_none).
Arguably (well, my interpretation), it should return a NULL only if the
entry is a table entry, potentially pointing to a next level (pmd). In
the pud case, this means that sz < PUD_SIZE.
If the pud is a last level huge page entry (either present or !present),
huge_pte_offset() should return the pointer to it and never NULL. If the
entry is a swap or migration one (pte_present() == false) with the
current code we don't even enter the corresponding checks in
copy_hugetlb_page_range().
I also assume that the ptl __unmap_hugepage_range() is taken to avoid
some race when the entry is a huge page (present or not). If such race
doesn't exist, we could as well check the huge_pte_none() outside the
locked region (which is what the current huge_pte_offset() does with
!pud_present()).
IMHO, while the current generic huge_pte_offset() avoids some code paths
in the functions you mentioned, the results are not always correct
(missing swap/migration entries or potentially racy).
--
Catalin
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-21 18:07 ` Catalin Marinas
0 siblings, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2017-08-21 18:07 UTC (permalink / raw)
To: Mike Kravetz
Cc: Punit Agrawal, Andrew Morton, linux-mm, linux-kernel, linux-arch,
Naoya Horiguchi, Steve Capper, Will Deacon, Kirill A . Shutemov,
Michal Hocko
On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote:
> On 08/18/2017 07:54 AM, Punit Agrawal wrote:
> > When walking the page tables to resolve an address that points to
> > !p*d_present() entry, huge_pte_offset() returns inconsistent values
> > depending on the level of page table (PUD or PMD).
> >
> > It returns NULL in the case of a PUD entry while in the case of a PMD
> > entry, it returns a pointer to the page table entry.
> >
> > A similar inconsitency exists when handling swap entries - returns NULL
> > for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
> >
> > Update huge_pte_offset() to make the behaviour consistent - return a
> > pointer to the pte_t for hugepage or swap entries. Only return NULL in
> > instances where we have a p*d_none() entry and the size parameter
> > doesn't match the hugepage size at this level of the page table.
> >
> > Document the behaviour to clarify the expected behaviour of this function.
> > This is to set clear semantics for architecture specific implementations
> > of huge_pte_offset().
> >
> > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Steve Capper <steve.capper@arm.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >
> > Hi Andrew,
> >
> > From discussions on the arm64 implementation of huge_pte_offset()[0]
> > we realised that there is benefit from returning a pte_t* in the case
> > of p*d_none().
> >
> > The fault handling code in hugetlb_fault() can handle p*d_none()
> > entries and saves an extra round trip to huge_pte_alloc(). Other
> > callers of huge_pte_offset() should be ok as well.
>
> Yes, this change would eliminate that call to huge_pte_alloc() in
> hugetlb_fault(). However, huge_pte_offset() is now returning a pointer
> to a p*d_none() pte in some instances where it would have previously
> returned NULL. Correct?
Yes (whether it was previously the right thing to return is a different
matter; that's what we are trying to clarify in the generic code so that
we can have similar semantics on arm64).
> I went through the callers, and like you am fairly confident that they
> can handle this situation. But, returning p*d_none() instead of NULL
> does change the execution path in several routines such as
> copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection,
> and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these
> routines, they do a quick continue, exit, etc. If they are returned
> a pointer, they typically lock the page table(s) and then check for
> p*d_none() before continuing, exiting, etc. So, it appears that these
> routines could potentially slow down a bit with this change (in the specific
> case of p*d_none).
Arguably (well, my interpretation), it should return a NULL only if the
entry is a table entry, potentially pointing to a next level (pmd). In
the pud case, this means that sz < PUD_SIZE.
If the pud is a last level huge page entry (either present or !present),
huge_pte_offset() should return the pointer to it and never NULL. If the
entry is a swap or migration one (pte_present() == false) with the
current code we don't even enter the corresponding checks in
copy_hugetlb_page_range().
I also assume that the ptl __unmap_hugepage_range() is taken to avoid
some race when the entry is a huge page (present or not). If such race
doesn't exist, we could as well check the huge_pte_none() outside the
locked region (which is what the current huge_pte_offset() does with
!pud_present()).
IMHO, while the current generic huge_pte_offset() avoids some code paths
in the functions you mentioned, the results are not always correct
(missing swap/migration entries or potentially racy).
--
Catalin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
2017-08-21 18:07 ` Catalin Marinas
@ 2017-08-21 21:30 ` Mike Kravetz
-1 siblings, 0 replies; 42+ messages in thread
From: Mike Kravetz @ 2017-08-21 21:30 UTC (permalink / raw)
To: Catalin Marinas
Cc: Punit Agrawal, Andrew Morton, linux-mm, linux-kernel, linux-arch,
Naoya Horiguchi, Steve Capper, Will Deacon, Kirill A . Shutemov,
Michal Hocko
On 08/21/2017 11:07 AM, Catalin Marinas wrote:
> On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote:
>> On 08/18/2017 07:54 AM, Punit Agrawal wrote:
>>> When walking the page tables to resolve an address that points to
>>> !p*d_present() entry, huge_pte_offset() returns inconsistent values
>>> depending on the level of page table (PUD or PMD).
>>>
>>> It returns NULL in the case of a PUD entry while in the case of a PMD
>>> entry, it returns a pointer to the page table entry.
>>>
>>> A similar inconsitency exists when handling swap entries - returns NULL
>>> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>>>
>>> Update huge_pte_offset() to make the behaviour consistent - return a
>>> pointer to the pte_t for hugepage or swap entries. Only return NULL in
>>> instances where we have a p*d_none() entry and the size parameter
>>> doesn't match the hugepage size at this level of the page table.
>>>
>>> Document the behaviour to clarify the expected behaviour of this function.
>>> This is to set clear semantics for architecture specific implementations
>>> of huge_pte_offset().
>>>
>>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>> Cc: Steve Capper <steve.capper@arm.com>
>>> Cc: Will Deacon <will.deacon@arm.com>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>
>>> Hi Andrew,
>>>
>>> From discussions on the arm64 implementation of huge_pte_offset()[0]
>>> we realised that there is benefit from returning a pte_t* in the case
>>> of p*d_none().
>>>
>>> The fault handling code in hugetlb_fault() can handle p*d_none()
>>> entries and saves an extra round trip to huge_pte_alloc(). Other
>>> callers of huge_pte_offset() should be ok as well.
>>
>> Yes, this change would eliminate that call to huge_pte_alloc() in
>> hugetlb_fault(). However, huge_pte_offset() is now returning a pointer
>> to a p*d_none() pte in some instances where it would have previously
>> returned NULL. Correct?
>
> Yes (whether it was previously the right thing to return is a different
> matter; that's what we are trying to clarify in the generic code so that
> we can have similar semantics on arm64).
>
>> I went through the callers, and like you am fairly confident that they
>> can handle this situation. But, returning p*d_none() instead of NULL
>> does change the execution path in several routines such as
>> copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection,
>> and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these
>> routines, they do a quick continue, exit, etc. If they are returned
>> a pointer, they typically lock the page table(s) and then check for
>> p*d_none() before continuing, exiting, etc. So, it appears that these
>> routines could potentially slow down a bit with this change (in the specific
>> case of p*d_none).
>
> Arguably (well, my interpretation), it should return a NULL only if the
> entry is a table entry, potentially pointing to a next level (pmd). In
> the pud case, this means that sz < PUD_SIZE.
>
> If the pud is a last level huge page entry (either present or !present),
> huge_pte_offset() should return the pointer to it and never NULL. If the
> entry is a swap or migration one (pte_present() == false) with the
> current code we don't even enter the corresponding checks in
> copy_hugetlb_page_range().
>
> I also assume that the ptl __unmap_hugepage_range() is taken to avoid
> some race when the entry is a huge page (present or not). If such race
> doesn't exist, we could as well check the huge_pte_none() outside the
> locked region (which is what the current huge_pte_offset() does with
> !pud_present()).
>
> IMHO, while the current generic huge_pte_offset() avoids some code paths
> in the functions you mentioned, the results are not always correct
> (missing swap/migration entries or potentially racy).
Thanks Catalin,
The more I look at this code and think about it, the more I like it. As
Michal previously mentioned, changes in this area can break things in subtle
ways. That is why I was cautious and asked for more people to look at it.
My primary concerns with these changes in this area were:
- Any potential changes in behavior. I think this has been sufficiently
explored. While there may be small differences in behavior (for the
better), this change should not introduce any bugs/breakage.
- Other arch specific implementations are not aligned with the new
behavior. Again, this should not cause any issues. Punit (and I) have
looked at the arch specific implementations for issues and found none.
In addition, since we are not changing any of the 'calling code', no
issues should be introduced for arch specific implementations.
I like the new semantics and did not find any issues.
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
--
Mike Kravetz
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-21 21:30 ` Mike Kravetz
0 siblings, 0 replies; 42+ messages in thread
From: Mike Kravetz @ 2017-08-21 21:30 UTC (permalink / raw)
To: Catalin Marinas
Cc: Punit Agrawal, Andrew Morton, linux-mm, linux-kernel, linux-arch,
Naoya Horiguchi, Steve Capper, Will Deacon, Kirill A . Shutemov,
Michal Hocko
On 08/21/2017 11:07 AM, Catalin Marinas wrote:
> On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote:
>> On 08/18/2017 07:54 AM, Punit Agrawal wrote:
>>> When walking the page tables to resolve an address that points to
>>> !p*d_present() entry, huge_pte_offset() returns inconsistent values
>>> depending on the level of page table (PUD or PMD).
>>>
>>> It returns NULL in the case of a PUD entry while in the case of a PMD
>>> entry, it returns a pointer to the page table entry.
>>>
>>> A similar inconsitency exists when handling swap entries - returns NULL
>>> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>>>
>>> Update huge_pte_offset() to make the behaviour consistent - return a
>>> pointer to the pte_t for hugepage or swap entries. Only return NULL in
>>> instances where we have a p*d_none() entry and the size parameter
>>> doesn't match the hugepage size at this level of the page table.
>>>
>>> Document the behaviour to clarify the expected behaviour of this function.
>>> This is to set clear semantics for architecture specific implementations
>>> of huge_pte_offset().
>>>
>>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>> Cc: Steve Capper <steve.capper@arm.com>
>>> Cc: Will Deacon <will.deacon@arm.com>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>
>>> Hi Andrew,
>>>
>>> From discussions on the arm64 implementation of huge_pte_offset()[0]
>>> we realised that there is benefit from returning a pte_t* in the case
>>> of p*d_none().
>>>
>>> The fault handling code in hugetlb_fault() can handle p*d_none()
>>> entries and saves an extra round trip to huge_pte_alloc(). Other
>>> callers of huge_pte_offset() should be ok as well.
>>
>> Yes, this change would eliminate that call to huge_pte_alloc() in
>> hugetlb_fault(). However, huge_pte_offset() is now returning a pointer
>> to a p*d_none() pte in some instances where it would have previously
>> returned NULL. Correct?
>
> Yes (whether it was previously the right thing to return is a different
> matter; that's what we are trying to clarify in the generic code so that
> we can have similar semantics on arm64).
>
>> I went through the callers, and like you am fairly confident that they
>> can handle this situation. But, returning p*d_none() instead of NULL
>> does change the execution path in several routines such as
>> copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection,
>> and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these
>> routines, they do a quick continue, exit, etc. If they are returned
>> a pointer, they typically lock the page table(s) and then check for
>> p*d_none() before continuing, exiting, etc. So, it appears that these
>> routines could potentially slow down a bit with this change (in the specific
>> case of p*d_none).
>
> Arguably (well, my interpretation), it should return a NULL only if the
> entry is a table entry, potentially pointing to a next level (pmd). In
> the pud case, this means that sz < PUD_SIZE.
>
> If the pud is a last level huge page entry (either present or !present),
> huge_pte_offset() should return the pointer to it and never NULL. If the
> entry is a swap or migration one (pte_present() == false) with the
> current code we don't even enter the corresponding checks in
> copy_hugetlb_page_range().
>
> I also assume that the ptl __unmap_hugepage_range() is taken to avoid
> some race when the entry is a huge page (present or not). If such race
> doesn't exist, we could as well check the huge_pte_none() outside the
> locked region (which is what the current huge_pte_offset() does with
> !pud_present()).
>
> IMHO, while the current generic huge_pte_offset() avoids some code paths
> in the functions you mentioned, the results are not always correct
> (missing swap/migration entries or potentially racy).
Thanks Catalin,
The more I look at this code and think about it, the more I like it. As
Michal previously mentioned, changes in this area can break things in subtle
ways. That is why I was cautious and asked for more people to look at it.
My primary concerns with these changes in this area were:
- Any potential changes in behavior. I think this has been sufficiently
explored. While there may be small differences in behavior (for the
better), this change should not introduce any bugs/breakage.
- Other arch specific implementations are not aligned with the new
behavior. Again, this should not cause any issues. Punit (and I) have
looked at the arch specific implementations for issues and found none.
In addition, since we are not changing any of the 'calling code', no
issues should be introduced for arch specific implementations.
I like the new semantics and did not find any issues.
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
--
Mike Kravetz
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
2017-08-21 21:30 ` Mike Kravetz
@ 2017-08-22 15:32 ` Punit Agrawal
-1 siblings, 0 replies; 42+ messages in thread
From: Punit Agrawal @ 2017-08-22 15:32 UTC (permalink / raw)
To: Mike Kravetz
Cc: Catalin Marinas, Andrew Morton, linux-mm, linux-kernel,
linux-arch, Naoya Horiguchi, Steve Capper, Will Deacon,
Kirill A . Shutemov, Michal Hocko
Hi Mike,
Mike Kravetz <mike.kravetz@oracle.com> writes:
> On 08/21/2017 11:07 AM, Catalin Marinas wrote:
>> On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote:
>>> On 08/18/2017 07:54 AM, Punit Agrawal wrote:
>>>> When walking the page tables to resolve an address that points to
>>>> !p*d_present() entry, huge_pte_offset() returns inconsistent values
>>>> depending on the level of page table (PUD or PMD).
>>>>
>>>> It returns NULL in the case of a PUD entry while in the case of a PMD
>>>> entry, it returns a pointer to the page table entry.
>>>>
>>>> A similar inconsitency exists when handling swap entries - returns NULL
>>>> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>>>>
>>>> Update huge_pte_offset() to make the behaviour consistent - return a
>>>> pointer to the pte_t for hugepage or swap entries. Only return NULL in
>>>> instances where we have a p*d_none() entry and the size parameter
>>>> doesn't match the hugepage size at this level of the page table.
>>>>
>>>> Document the behaviour to clarify the expected behaviour of this function.
>>>> This is to set clear semantics for architecture specific implementations
>>>> of huge_pte_offset().
>>>>
>>>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>> Cc: Steve Capper <steve.capper@arm.com>
>>>> Cc: Will Deacon <will.deacon@arm.com>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> ---
>>>>
>>>> Hi Andrew,
>>>>
>>>> From discussions on the arm64 implementation of huge_pte_offset()[0]
>>>> we realised that there is benefit from returning a pte_t* in the case
>>>> of p*d_none().
>>>>
>>>> The fault handling code in hugetlb_fault() can handle p*d_none()
>>>> entries and saves an extra round trip to huge_pte_alloc(). Other
>>>> callers of huge_pte_offset() should be ok as well.
>>>
>>> Yes, this change would eliminate that call to huge_pte_alloc() in
>>> hugetlb_fault(). However, huge_pte_offset() is now returning a pointer
>>> to a p*d_none() pte in some instances where it would have previously
>>> returned NULL. Correct?
>>
>> Yes (whether it was previously the right thing to return is a different
>> matter; that's what we are trying to clarify in the generic code so that
>> we can have similar semantics on arm64).
>>
>>> I went through the callers, and like you am fairly confident that they
>>> can handle this situation. But, returning p*d_none() instead of NULL
>>> does change the execution path in several routines such as
>>> copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection,
>>> and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these
>>> routines, they do a quick continue, exit, etc. If they are returned
>>> a pointer, they typically lock the page table(s) and then check for
>>> p*d_none() before continuing, exiting, etc. So, it appears that these
>>> routines could potentially slow down a bit with this change (in the specific
>>> case of p*d_none).
>>
>> Arguably (well, my interpretation), it should return a NULL only if the
>> entry is a table entry, potentially pointing to a next level (pmd). In
>> the pud case, this means that sz < PUD_SIZE.
>>
>> If the pud is a last level huge page entry (either present or !present),
>> huge_pte_offset() should return the pointer to it and never NULL. If the
>> entry is a swap or migration one (pte_present() == false) with the
>> current code we don't even enter the corresponding checks in
>> copy_hugetlb_page_range().
>>
>> I also assume that the ptl __unmap_hugepage_range() is taken to avoid
>> some race when the entry is a huge page (present or not). If such race
>> doesn't exist, we could as well check the huge_pte_none() outside the
>> locked region (which is what the current huge_pte_offset() does with
>> !pud_present()).
>>
>> IMHO, while the current generic huge_pte_offset() avoids some code paths
>> in the functions you mentioned, the results are not always correct
>> (missing swap/migration entries or potentially racy).
>
> Thanks Catalin,
>
> The more I look at this code and think about it, the more I like it. As
> Michal previously mentioned, changes in this area can break things in subtle
> ways. That is why I was cautious and asked for more people to look at it.
> My primary concerns with these changes in this area were:
> - Any potential changes in behavior. I think this has been sufficiently
> explored. While there may be small differences in behavior (for the
> better), this change should not introduce any bugs/breakage.
> - Other arch specific implementations are not aligned with the new
> behavior. Again, this should not cause any issues. Punit (and I) have
> looked at the arch specific implementations for issues and found none.
> In addition, since we are not changing any of the 'calling code', no
> issues should be introduced for arch specific implementations.
>
> I like the new semantics and did not find any issues.
>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Thanks for reviewing the updated semantics against existing usage. I'll
monitor the lists for any reported breakage but please do shout out if
you notice any issues.
Thanks,
Punit
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-22 15:32 ` Punit Agrawal
0 siblings, 0 replies; 42+ messages in thread
From: Punit Agrawal @ 2017-08-22 15:32 UTC (permalink / raw)
To: Mike Kravetz
Cc: Catalin Marinas, Andrew Morton, linux-mm, linux-kernel,
linux-arch, Naoya Horiguchi, Steve Capper, Will Deacon,
Kirill A . Shutemov, Michal Hocko
Hi Mike,
Mike Kravetz <mike.kravetz@oracle.com> writes:
> On 08/21/2017 11:07 AM, Catalin Marinas wrote:
>> On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote:
>>> On 08/18/2017 07:54 AM, Punit Agrawal wrote:
>>>> When walking the page tables to resolve an address that points to
>>>> !p*d_present() entry, huge_pte_offset() returns inconsistent values
>>>> depending on the level of page table (PUD or PMD).
>>>>
>>>> It returns NULL in the case of a PUD entry while in the case of a PMD
>>>> entry, it returns a pointer to the page table entry.
>>>>
>>>> A similar inconsitency exists when handling swap entries - returns NULL
>>>> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>>>>
>>>> Update huge_pte_offset() to make the behaviour consistent - return a
>>>> pointer to the pte_t for hugepage or swap entries. Only return NULL in
>>>> instances where we have a p*d_none() entry and the size parameter
>>>> doesn't match the hugepage size at this level of the page table.
>>>>
>>>> Document the behaviour to clarify the expected behaviour of this function.
>>>> This is to set clear semantics for architecture specific implementations
>>>> of huge_pte_offset().
>>>>
>>>> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>> Cc: Steve Capper <steve.capper@arm.com>
>>>> Cc: Will Deacon <will.deacon@arm.com>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> ---
>>>>
>>>> Hi Andrew,
>>>>
>>>> From discussions on the arm64 implementation of huge_pte_offset()[0]
>>>> we realised that there is benefit from returning a pte_t* in the case
>>>> of p*d_none().
>>>>
>>>> The fault handling code in hugetlb_fault() can handle p*d_none()
>>>> entries and saves an extra round trip to huge_pte_alloc(). Other
>>>> callers of huge_pte_offset() should be ok as well.
>>>
>>> Yes, this change would eliminate that call to huge_pte_alloc() in
>>> hugetlb_fault(). However, huge_pte_offset() is now returning a pointer
>>> to a p*d_none() pte in some instances where it would have previously
>>> returned NULL. Correct?
>>
>> Yes (whether it was previously the right thing to return is a different
>> matter; that's what we are trying to clarify in the generic code so that
>> we can have similar semantics on arm64).
>>
>>> I went through the callers, and like you am fairly confident that they
>>> can handle this situation. But, returning p*d_none() instead of NULL
>>> does change the execution path in several routines such as
>>> copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection,
>>> and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these
>>> routines, they do a quick continue, exit, etc. If they are returned
>>> a pointer, they typically lock the page table(s) and then check for
>>> p*d_none() before continuing, exiting, etc. So, it appears that these
>>> routines could potentially slow down a bit with this change (in the specific
>>> case of p*d_none).
>>
>> Arguably (well, my interpretation), it should return a NULL only if the
>> entry is a table entry, potentially pointing to a next level (pmd). In
>> the pud case, this means that sz < PUD_SIZE.
>>
>> If the pud is a last level huge page entry (either present or !present),
>> huge_pte_offset() should return the pointer to it and never NULL. If the
>> entry is a swap or migration one (pte_present() == false) with the
>> current code we don't even enter the corresponding checks in
>> copy_hugetlb_page_range().
>>
>> I also assume that the ptl __unmap_hugepage_range() is taken to avoid
>> some race when the entry is a huge page (present or not). If such race
>> doesn't exist, we could as well check the huge_pte_none() outside the
>> locked region (which is what the current huge_pte_offset() does with
>> !pud_present()).
>>
>> IMHO, while the current generic huge_pte_offset() avoids some code paths
>> in the functions you mentioned, the results are not always correct
>> (missing swap/migration entries or potentially racy).
>
> Thanks Catalin,
>
> The more I look at this code and think about it, the more I like it. As
> Michal previously mentioned, changes in this area can break things in subtle
> ways. That is why I was cautious and asked for more people to look at it.
> My primary concerns with these changes in this area were:
> - Any potential changes in behavior. I think this has been sufficiently
> explored. While there may be small differences in behavior (for the
> better), this change should not introduce any bugs/breakage.
> - Other arch specific implementations are not aligned with the new
> behavior. Again, this should not cause any issues. Punit (and I) have
> looked at the arch specific implementations for issues and found none.
> In addition, since we are not changing any of the 'calling code', no
> issues should be introduced for arch specific implementations.
>
> I like the new semantics and did not find any issues.
>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Thanks for reviewing the updated semantics against existing usage. I'll
monitor the lists for any reported breakage but please do shout out if
you notice any issues.
Thanks,
Punit
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
2017-08-18 14:54 ` Punit Agrawal
@ 2017-08-22 10:11 ` Catalin Marinas
-1 siblings, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2017-08-22 10:11 UTC (permalink / raw)
To: Punit Agrawal
Cc: Andrew Morton, linux-mm, linux-kernel, linux-arch,
Naoya Horiguchi, Steve Capper, Will Deacon, Kirill A . Shutemov,
Michal Hocko, Mike Kravetz
On Fri, Aug 18, 2017 at 03:54:15PM +0100, Punit Agrawal wrote:
> When walking the page tables to resolve an address that points to
> !p*d_present() entry, huge_pte_offset() returns inconsistent values
> depending on the level of page table (PUD or PMD).
>
> It returns NULL in the case of a PUD entry while in the case of a PMD
> entry, it returns a pointer to the page table entry.
>
> A similar inconsitency exists when handling swap entries - returns NULL
> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>
> Update huge_pte_offset() to make the behaviour consistent - return a
> pointer to the pte_t for hugepage or swap entries. Only return NULL in
> instances where we have a p*d_none() entry and the size parameter
> doesn't match the hugepage size at this level of the page table.
>
> Document the behaviour to clarify the expected behaviour of this function.
> This is to set clear semantics for architecture specific implementations
> of huge_pte_offset().
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
FWIW:
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Thanks.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-22 10:11 ` Catalin Marinas
0 siblings, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2017-08-22 10:11 UTC (permalink / raw)
To: Punit Agrawal
Cc: Andrew Morton, linux-mm, linux-kernel, linux-arch,
Naoya Horiguchi, Steve Capper, Will Deacon, Kirill A . Shutemov,
Michal Hocko, Mike Kravetz
On Fri, Aug 18, 2017 at 03:54:15PM +0100, Punit Agrawal wrote:
> When walking the page tables to resolve an address that points to
> !p*d_present() entry, huge_pte_offset() returns inconsistent values
> depending on the level of page table (PUD or PMD).
>
> It returns NULL in the case of a PUD entry while in the case of a PMD
> entry, it returns a pointer to the page table entry.
>
> A similar inconsitency exists when handling swap entries - returns NULL
> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>
> Update huge_pte_offset() to make the behaviour consistent - return a
> pointer to the pte_t for hugepage or swap entries. Only return NULL in
> instances where we have a p*d_none() entry and the size parameter
> doesn't match the hugepage size at this level of the page table.
>
> Document the behaviour to clarify the expected behaviour of this function.
> This is to set clear semantics for architecture specific implementations
> of huge_pte_offset().
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
FWIW:
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
2017-08-18 14:54 ` Punit Agrawal
@ 2017-08-30 7:49 ` Michal Hocko
-1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-08-30 7:49 UTC (permalink / raw)
To: Punit Agrawal
Cc: Andrew Morton, linux-mm, linux-kernel, linux-arch,
Catalin Marinas, Naoya Horiguchi, Steve Capper, Will Deacon,
Kirill A . Shutemov, Mike Kravetz
On Fri 18-08-17 15:54:15, Punit Agrawal wrote:
> When walking the page tables to resolve an address that points to
> !p*d_present() entry, huge_pte_offset() returns inconsistent values
> depending on the level of page table (PUD or PMD).
>
> It returns NULL in the case of a PUD entry while in the case of a PMD
> entry, it returns a pointer to the page table entry.
>
> A similar inconsitency exists when handling swap entries - returns NULL
> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>
> Update huge_pte_offset() to make the behaviour consistent - return a
> pointer to the pte_t for hugepage or swap entries. Only return NULL in
> instances where we have a p*d_none() entry and the size parameter
> doesn't match the hugepage size at this level of the page table.
>
> Document the behaviour to clarify the expected behaviour of this function.
> This is to set clear semantics for architecture specific implementations
> of huge_pte_offset().
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
I always thought that the weird semantic is a result of the hugetlb pte
sharing. But now that I dug into history it has been added by
02b0ccef903e ("[PATCH] hugetlb: check p?d_present in huge_pte_offset()")
for a completely different reason. I suspec the weird semantic just
wasn't noticed back then.
Anyway, I didn't find any problem with the patch
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi Andrew,
>
> >From discussions on the arm64 implementation of huge_pte_offset()[0]
> we realised that there is benefit from returning a pte_t* in the case
> of p*d_none().
>
> The fault handling code in hugetlb_fault() can handle p*d_none()
> entries and saves an extra round trip to huge_pte_alloc(). Other
> callers of huge_pte_offset() should be ok as well.
>
> Apologies for sending a late update but I thought if we are defining
> the semantics, it's worth getting them right.
>
> Could you please pick this version please?
>
> Thanks,
> Punit
>
> [0] http://www.spinics.net/lists/linux-mm/msg133699.html
>
> v2:
>
> mm/hugetlb.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 31e207cb399b..1d54a131bdd5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> return pte;
> }
>
> +/*
> + * huge_pte_offset() - Walk the page table to resolve the hugepage
> + * entry at address @addr
> + *
> + * Return: Pointer to page table or swap entry (PUD or PMD) for
> + * address @addr, or NULL if a p*d_none() entry is encountered and the
> + * size @sz doesn't match the hugepage size at this level of the page
> + * table.
> + */
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz)
> {
> @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> p4d = p4d_offset(pgd, addr);
> if (!p4d_present(*p4d))
> return NULL;
> +
> pud = pud_offset(p4d, addr);
> - if (!pud_present(*pud))
> + if (sz != PUD_SIZE && pud_none(*pud))
> return NULL;
> - if (pud_huge(*pud))
> + /* hugepage or swap? */
> + if (pud_huge(*pud) || !pud_present(*pud))
> return (pte_t *)pud;
> +
> pmd = pmd_offset(pud, addr);
> - return (pte_t *) pmd;
> + if (sz != PMD_SIZE && pmd_none(*pmd))
> + return NULL;
> + /* hugepage or swap? */
> + if (pmd_huge(*pmd) || !pmd_present(*pmd))
> + return (pte_t *)pmd;
> +
> + return NULL;
> }
>
> #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> --
> 2.13.2
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour
@ 2017-08-30 7:49 ` Michal Hocko
0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-08-30 7:49 UTC (permalink / raw)
To: Punit Agrawal
Cc: Andrew Morton, linux-mm, linux-kernel, linux-arch,
Catalin Marinas, Naoya Horiguchi, Steve Capper, Will Deacon,
Kirill A . Shutemov, Mike Kravetz
On Fri 18-08-17 15:54:15, Punit Agrawal wrote:
> When walking the page tables to resolve an address that points to
> !p*d_present() entry, huge_pte_offset() returns inconsistent values
> depending on the level of page table (PUD or PMD).
>
> It returns NULL in the case of a PUD entry while in the case of a PMD
> entry, it returns a pointer to the page table entry.
>
> A similar inconsitency exists when handling swap entries - returns NULL
> for a PUD entry while a pointer to the pte_t is retured for the PMD entry.
>
> Update huge_pte_offset() to make the behaviour consistent - return a
> pointer to the pte_t for hugepage or swap entries. Only return NULL in
> instances where we have a p*d_none() entry and the size parameter
> doesn't match the hugepage size at this level of the page table.
>
> Document the behaviour to clarify the expected behaviour of this function.
> This is to set clear semantics for architecture specific implementations
> of huge_pte_offset().
>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
I always thought that the weird semantic is a result of the hugetlb pte
sharing. But now that I dug into history it has been added by
02b0ccef903e ("[PATCH] hugetlb: check p?d_present in huge_pte_offset()")
for a completely different reason. I suspec the weird semantic just
wasn't noticed back then.
Anyway, I didn't find any problem with the patch
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi Andrew,
>
> >From discussions on the arm64 implementation of huge_pte_offset()[0]
> we realised that there is benefit from returning a pte_t* in the case
> of p*d_none().
>
> The fault handling code in hugetlb_fault() can handle p*d_none()
> entries and saves an extra round trip to huge_pte_alloc(). Other
> callers of huge_pte_offset() should be ok as well.
>
> Apologies for sending a late update but I thought if we are defining
> the semantics, it's worth getting them right.
>
> Could you please pick this version please?
>
> Thanks,
> Punit
>
> [0] http://www.spinics.net/lists/linux-mm/msg133699.html
>
> v2:
>
> mm/hugetlb.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 31e207cb399b..1d54a131bdd5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
> return pte;
> }
>
> +/*
> + * huge_pte_offset() - Walk the page table to resolve the hugepage
> + * entry at address @addr
> + *
> + * Return: Pointer to page table or swap entry (PUD or PMD) for
> + * address @addr, or NULL if a p*d_none() entry is encountered and the
> + * size @sz doesn't match the hugepage size at this level of the page
> + * table.
> + */
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz)
> {
> @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> p4d = p4d_offset(pgd, addr);
> if (!p4d_present(*p4d))
> return NULL;
> +
> pud = pud_offset(p4d, addr);
> - if (!pud_present(*pud))
> + if (sz != PUD_SIZE && pud_none(*pud))
> return NULL;
> - if (pud_huge(*pud))
> + /* hugepage or swap? */
> + if (pud_huge(*pud) || !pud_present(*pud))
> return (pte_t *)pud;
> +
> pmd = pmd_offset(pud, addr);
> - return (pte_t *) pmd;
> + if (sz != PMD_SIZE && pmd_none(*pmd))
> + return NULL;
> + /* hugepage or swap? */
> + if (pmd_huge(*pmd) || !pmd_present(*pmd))
> + return (pte_t *)pmd;
> +
> + return NULL;
> }
>
> #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> --
> 2.13.2
>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread