From: Mike Kravetz <mike.kravetz@oracle.com> To: Punit Agrawal <punit.agrawal@arm.com>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Catalin Marinas <catalin.marinas@arm.com>, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>, Steve Capper <steve.capper@arm.com>, Will Deacon <will.deacon@arm.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Michal Hocko <mhocko@suse.com> Subject: Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour Date: Fri, 18 Aug 2017 14:29:18 -0700 [thread overview] Message-ID: <3de49294-f6f8-2623-1778-56a3b092f2a5@oracle.com> (raw) In-Reply-To: <20170818145415.7588-1-punit.agrawal@arm.com> On 08/18/2017 07:54 AM, Punit Agrawal wrote: > When walking the page tables to resolve an address that points to > !p*d_present() entry, huge_pte_offset() returns inconsistent values > depending on the level of page table (PUD or PMD). > > It returns NULL in the case of a PUD entry while in the case of a PMD > entry, it returns a pointer to the page table entry. > > A similar inconsitency exists when handling swap entries - returns NULL > for a PUD entry while a pointer to the pte_t is retured for the PMD entry. > > Update huge_pte_offset() to make the behaviour consistent - return a > pointer to the pte_t for hugepage or swap entries. Only return NULL in > instances where we have a p*d_none() entry and the size parameter > doesn't match the hugepage size at this level of the page table. > > Document the behaviour to clarify the expected behaviour of this function. > This is to set clear semantics for architecture specific implementations > of huge_pte_offset(). > > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: Steve Capper <steve.capper@arm.com> > Cc: Will Deacon <will.deacon@arm.com> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > --- > > Hi Andrew, > > From discussions on the arm64 implementation of huge_pte_offset()[0] > we realised that there is benefit from returning a pte_t* in the case > of p*d_none(). > > The fault handling code in hugetlb_fault() can handle p*d_none() > entries and saves an extra round trip to huge_pte_alloc(). Other > callers of huge_pte_offset() should be ok as well. Yes, this change would eliminate that call to huge_pte_alloc() in hugetlb_fault(). However, huge_pte_offset() is now returning a pointer to a p*d_none() pte in some instances where it would have previously returned NULL. Correct? I went through the callers, and like you am fairly confident that they can handle this situation. But, returning p*d_none() instead of NULL does change the execution path in several routines such as copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection, and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these routines, they do a quick continue, exit, etc. If they are returned a pointer, they typically lock the page table(s) and then check for p*d_none() before continuing, exiting, etc. So, it appears that these routines could potentially slow down a bit with this change (in the specific case of p*d_none). I 'think' one could argue that the the fault case is more important. So, the savings there would outweigh any potential slowdown in the other routines. IMO, this new version of the patch has more potential for issues than the previous version. It would be helpful if others could take a look. One thing I am still 'thinking' about is how this patch could potentially change behavior in huge_pmd_share. With the patch, pmd sharing could potentially be set up in situations (pmd_none) where it previously would not have been set up. I don't think this is an issue, but any changes to this concerns me. -- Mike Kravetz > > Apologies for sending a late update but I thought if we are defining > the semantics, it's worth getting them right. > > Could you please pick this version please? > > Thanks, > Punit > > [0] http://www.spinics.net/lists/linux-mm/msg133699.html > > v2: > > mm/hugetlb.c | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 31e207cb399b..1d54a131bdd5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > return pte; > } > > +/* > + * huge_pte_offset() - Walk the page table to resolve the hugepage > + * entry at address @addr > + * > + * Return: Pointer to page table or swap entry (PUD or PMD) for > + * address @addr, or NULL if a p*d_none() entry is encountered and the > + * size @sz doesn't match the hugepage size at this level of the page > + * table. > + */ > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz) > { > @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm, > p4d = p4d_offset(pgd, addr); > if (!p4d_present(*p4d)) > return NULL; > + > pud = pud_offset(p4d, addr); > - if (!pud_present(*pud)) > + if (sz != PUD_SIZE && pud_none(*pud)) > return NULL; > - if (pud_huge(*pud)) > + /* hugepage or swap? */ > + if (pud_huge(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > + > pmd = pmd_offset(pud, addr); > - return (pte_t *) pmd; > + if (sz != PMD_SIZE && pmd_none(*pmd)) > + return NULL; > + /* hugepage or swap? */ > + if (pmd_huge(*pmd) || !pmd_present(*pmd)) > + return (pte_t *)pmd; > + > + return NULL; > } > > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ >
WARNING: multiple messages have this Message-ID (diff)
From: Mike Kravetz <mike.kravetz@oracle.com> To: Punit Agrawal <punit.agrawal@arm.com>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Catalin Marinas <catalin.marinas@arm.com>, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>, Steve Capper <steve.capper@arm.com>, Will Deacon <will.deacon@arm.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Michal Hocko <mhocko@suse.com> Subject: Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour Date: Fri, 18 Aug 2017 14:29:18 -0700 [thread overview] Message-ID: <3de49294-f6f8-2623-1778-56a3b092f2a5@oracle.com> (raw) In-Reply-To: <20170818145415.7588-1-punit.agrawal@arm.com> On 08/18/2017 07:54 AM, Punit Agrawal wrote: > When walking the page tables to resolve an address that points to > !p*d_present() entry, huge_pte_offset() returns inconsistent values > depending on the level of page table (PUD or PMD). > > It returns NULL in the case of a PUD entry while in the case of a PMD > entry, it returns a pointer to the page table entry. > > A similar inconsitency exists when handling swap entries - returns NULL > for a PUD entry while a pointer to the pte_t is retured for the PMD entry. > > Update huge_pte_offset() to make the behaviour consistent - return a > pointer to the pte_t for hugepage or swap entries. Only return NULL in > instances where we have a p*d_none() entry and the size parameter > doesn't match the hugepage size at this level of the page table. > > Document the behaviour to clarify the expected behaviour of this function. > This is to set clear semantics for architecture specific implementations > of huge_pte_offset(). > > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: Steve Capper <steve.capper@arm.com> > Cc: Will Deacon <will.deacon@arm.com> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > --- > > Hi Andrew, > > From discussions on the arm64 implementation of huge_pte_offset()[0] > we realised that there is benefit from returning a pte_t* in the case > of p*d_none(). > > The fault handling code in hugetlb_fault() can handle p*d_none() > entries and saves an extra round trip to huge_pte_alloc(). Other > callers of huge_pte_offset() should be ok as well. Yes, this change would eliminate that call to huge_pte_alloc() in hugetlb_fault(). However, huge_pte_offset() is now returning a pointer to a p*d_none() pte in some instances where it would have previously returned NULL. Correct? I went through the callers, and like you am fairly confident that they can handle this situation. But, returning p*d_none() instead of NULL does change the execution path in several routines such as copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection, and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these routines, they do a quick continue, exit, etc. If they are returned a pointer, they typically lock the page table(s) and then check for p*d_none() before continuing, exiting, etc. So, it appears that these routines could potentially slow down a bit with this change (in the specific case of p*d_none). I 'think' one could argue that the the fault case is more important. So, the savings there would outweigh any potential slowdown in the other routines. IMO, this new version of the patch has more potential for issues than the previous version. It would be helpful if others could take a look. One thing I am still 'thinking' about is how this patch could potentially change behavior in huge_pmd_share. With the patch, pmd sharing could potentially be set up in situations (pmd_none) where it previously would not have been set up. I don't think this is an issue, but any changes to this concerns me. -- Mike Kravetz > > Apologies for sending a late update but I thought if we are defining > the semantics, it's worth getting them right. > > Could you please pick this version please? > > Thanks, > Punit > > [0] http://www.spinics.net/lists/linux-mm/msg133699.html > > v2: > > mm/hugetlb.c | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 31e207cb399b..1d54a131bdd5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > return pte; > } > > +/* > + * huge_pte_offset() - Walk the page table to resolve the hugepage > + * entry at address @addr > + * > + * Return: Pointer to page table or swap entry (PUD or PMD) for > + * address @addr, or NULL if a p*d_none() entry is encountered and the > + * size @sz doesn't match the hugepage size at this level of the page > + * table. > + */ > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz) > { > @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm, > p4d = p4d_offset(pgd, addr); > if (!p4d_present(*p4d)) > return NULL; > + > pud = pud_offset(p4d, addr); > - if (!pud_present(*pud)) > + if (sz != PUD_SIZE && pud_none(*pud)) > return NULL; > - if (pud_huge(*pud)) > + /* hugepage or swap? */ > + if (pud_huge(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > + > pmd = pmd_offset(pud, addr); > - return (pte_t *) pmd; > + if (sz != PMD_SIZE && pmd_none(*pmd)) > + return NULL; > + /* hugepage or swap? */ > + if (pmd_huge(*pmd) || !pmd_present(*pmd)) > + return (pte_t *)pmd; > + > + return NULL; > } > > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-08-18 21:30 UTC|newest] Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-07-25 15:41 [PATCH 0/1] Clarify huge_pte_offset() semantics Punit Agrawal 2017-07-25 15:41 ` Punit Agrawal 2017-07-25 15:41 ` [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour Punit Agrawal 2017-07-25 15:41 ` Punit Agrawal 2017-07-26 8:39 ` Catalin Marinas 2017-07-26 8:39 ` Catalin Marinas 2017-07-26 8:50 ` Michal Hocko 2017-07-26 8:50 ` Michal Hocko 2017-07-26 8:53 ` Michal Hocko 2017-07-26 8:53 ` Michal Hocko 2017-07-26 12:11 ` Punit Agrawal 2017-07-26 12:11 ` Punit Agrawal 2017-07-26 12:11 ` Punit Agrawal 2017-07-26 12:33 ` Michal Hocko 2017-07-26 12:33 ` Michal Hocko 2017-07-26 12:47 ` Michal Hocko 2017-07-26 12:47 ` Michal Hocko 2017-07-26 13:34 ` Punit Agrawal 2017-07-26 13:34 ` Punit Agrawal 2017-07-26 13:34 ` Punit Agrawal 2017-07-27 3:16 ` Mike Kravetz 2017-07-27 3:16 ` Mike Kravetz 2017-07-27 12:58 ` Punit Agrawal 2017-07-27 12:58 ` Punit Agrawal 2017-07-27 12:58 ` Punit Agrawal 2017-07-27 12:58 ` Punit Agrawal 2017-08-18 14:54 ` [PATCH v2] mm/hugetlb.c: make " Punit Agrawal 2017-08-18 14:54 ` Punit Agrawal 2017-08-18 14:54 ` Punit Agrawal 2017-08-18 14:54 ` Punit Agrawal 2017-08-18 21:29 ` Mike Kravetz [this message] 2017-08-18 21:29 ` Mike Kravetz 2017-08-21 18:07 ` Catalin Marinas 2017-08-21 18:07 ` Catalin Marinas 2017-08-21 21:30 ` Mike Kravetz 2017-08-21 21:30 ` Mike Kravetz 2017-08-22 15:32 ` Punit Agrawal 2017-08-22 15:32 ` Punit Agrawal 2017-08-22 10:11 ` Catalin Marinas 2017-08-22 10:11 ` Catalin Marinas 2017-08-30 7:49 ` Michal Hocko 2017-08-30 7:49 ` Michal Hocko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=3de49294-f6f8-2623-1778-56a3b092f2a5@oracle.com \ --to=mike.kravetz@oracle.com \ --cc=akpm@linux-foundation.org \ --cc=catalin.marinas@arm.com \ --cc=kirill.shutemov@linux.intel.com \ --cc=linux-arch@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mhocko@suse.com \ --cc=n-horiguchi@ah.jp.nec.com \ --cc=punit.agrawal@arm.com \ --cc=steve.capper@arm.com \ --cc=will.deacon@arm.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.