From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id BFD636B02C3 for ; Tue, 25 Jul 2017 11:41:29 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id c14so186551168pgn.11 for ; Tue, 25 Jul 2017 08:41:29 -0700 (PDT) Received: from foss.arm.com (foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id p2si8817860pli.415.2017.07.25.08.41.28 for ; Tue, 25 Jul 2017 08:41:28 -0700 (PDT) From: Punit Agrawal Subject: [PATCH 0/1] Clarify huge_pte_offset() semantics Date: Tue, 25 Jul 2017 16:41:13 +0100 Message-Id: <20170725154114.24131-1-punit.agrawal@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Punit Agrawal , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com, Michal Hocko , Mike Kravetz Hi, The following patch is an attempt to make huge_pte_offset() consistent when dealing with different levels of the page table and document the expected semantics. Previously posting can be found at [0]. Changelog RFC - v1 * Merge Patch 1 and 2 - preserve bisectability * Drop RFC tag Original cover letter follows... The generic implementation of huge_pte_offset() has inconsistent behaviour when looking up hugepage PUDs vs PMDs entries that are not present (returning NULL vs pte_t*). Similarly, it returns NULL when encountering swap entries although all the callers have special checks to properly deal with swap entries. Without clear semantics, it is difficult to determine if a change breaks huge_pte_offset() without going through all the scenarios where it is used. I faced this recently when updating the arm64 implementation of huge_pte_offset() to handle swap entries (related to enabling poisoned memeory)[1]. And will come across again when I update it for contiguous hugepage support now that core changes have been merged. To address these issues, this following patch - * makes huge_pte_offset() consistent between PUD and PMDs * and, documents the expected behaviour of huge_pte_offset() All feedback welcome. Thanks, Punit [0] https://lkml.org/lkml/2017/7/24/514 [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f02ab08afbe76ee7b0b2a34a9970e7dd200d8b01 Punit Agrawal (1): mm/hugetlb: Make huge_pte_offset() consistent and document behaviour mm/hugetlb.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 9D88F6B02F4 for ; Tue, 25 Jul 2017 11:41:36 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id b8so47484864pgn.10 for ; Tue, 25 Jul 2017 08:41:36 -0700 (PDT) Received: from foss.arm.com (foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id b5si437857pgc.359.2017.07.25.08.41.35 for ; Tue, 25 Jul 2017 08:41:35 -0700 (PDT) From: Punit Agrawal Subject: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour Date: Tue, 25 Jul 2017 16:41:14 +0100 Message-Id: <20170725154114.24131-2-punit.agrawal@arm.com> In-Reply-To: <20170725154114.24131-1-punit.agrawal@arm.com> References: <20170725154114.24131-1-punit.agrawal@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Punit Agrawal , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com, Michal Hocko , Mike Kravetz When walking the page tables to resolve an address that points to !p*d_present() entry, huge_pte_offset() returns inconsistent values depending on the level of page table (PUD or PMD). It returns NULL in the case of a PUD entry while in the case of a PMD entry, it returns a pointer to the page table entry. A similar inconsitency exists when handling swap entries - returns NULL for a PUD entry while a pointer to the pte_t is retured for the PMD entry. Update huge_pte_offset() to make the behaviour consistent - return NULL in the case of p*d_none() and a pointer to the pte_t for hugepage or swap entries. Document the behaviour to clarify the expected behaviour of this function. This is to set clear semantics for architecture specific implementations of huge_pte_offset(). Signed-off-by: Punit Agrawal --- mm/hugetlb.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bc48ee783dd9..72dd1139a8e4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4603,6 +4603,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, return pte; } +/* + * huge_pte_offset() - Walk the page table to resolve the hugepage + * entry at address @addr + * + * Return: Pointer to page table or swap entry (PUD or PMD) for address @addr + * or NULL if the entry is p*d_none(). + */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz) { @@ -4617,13 +4624,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm, p4d = p4d_offset(pgd, addr); if (!p4d_present(*p4d)) return NULL; + pud = pud_offset(p4d, addr); - if (!pud_present(*pud)) + if (pud_none(*pud)) return NULL; - if (pud_huge(*pud)) + /* hugepage or swap? */ + if (pud_huge(*pud) || !pud_present(*pud)) return (pte_t *)pud; + pmd = pmd_offset(pud, addr); - return (pte_t *) pmd; + if (pmd_none(*pmd)) + return NULL; + /* hugepage or swap? */ + if (pmd_huge(*pmd) || !pmd_present(*pmd)) + return (pte_t *) pmd; + + return NULL; } #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id AF8D06B02F4 for ; Wed, 26 Jul 2017 04:39:53 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id u199so111479084pgb.13 for ; Wed, 26 Jul 2017 01:39:53 -0700 (PDT) Received: from foss.arm.com (foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id e9si9284577pgn.812.2017.07.26.01.39.52 for ; Wed, 26 Jul 2017 01:39:52 -0700 (PDT) Date: Wed, 26 Jul 2017 09:39:46 +0100 From: Catalin Marinas Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour Message-ID: <20170726083945.4ejqwnxomplrqxrf@armageddon.cambridge.arm.com> References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170725154114.24131-2-punit.agrawal@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal Cc: Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, kirill.shutemov@linux.intel.com, Michal Hocko , Mike Kravetz On Tue, Jul 25, 2017 at 04:41:14PM +0100, Punit Agrawal wrote: > When walking the page tables to resolve an address that points to > !p*d_present() entry, huge_pte_offset() returns inconsistent values > depending on the level of page table (PUD or PMD). > > It returns NULL in the case of a PUD entry while in the case of a PMD > entry, it returns a pointer to the page table entry. > > A similar inconsitency exists when handling swap entries - returns NULL > for a PUD entry while a pointer to the pte_t is retured for the PMD > entry. > > Update huge_pte_offset() to make the behaviour consistent - return NULL > in the case of p*d_none() and a pointer to the pte_t for hugepage or > swap entries. > > Document the behaviour to clarify the expected behaviour of this > function. This is to set clear semantics for architecture specific > implementations of huge_pte_offset(). > > Signed-off-by: Punit Agrawal Reviewed-by: Catalin Marinas -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id 05F996B02F4 for ; Wed, 26 Jul 2017 04:50:43 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id h126so7552694wmf.10 for ; Wed, 26 Jul 2017 01:50:42 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id q23si16801882wrc.56.2017.07.26.01.50.41 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 26 Jul 2017 01:50:41 -0700 (PDT) Date: Wed, 26 Jul 2017 10:50:38 +0200 From: Michal Hocko Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour Message-ID: <20170726085038.GB2981@dhcp22.suse.cz> References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170725154114.24131-2-punit.agrawal@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal Cc: Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com, Mike Kravetz On Tue 25-07-17 16:41:14, Punit Agrawal wrote: > When walking the page tables to resolve an address that points to > !p*d_present() entry, huge_pte_offset() returns inconsistent values > depending on the level of page table (PUD or PMD). > > It returns NULL in the case of a PUD entry while in the case of a PMD > entry, it returns a pointer to the page table entry. > > A similar inconsitency exists when handling swap entries - returns NULL > for a PUD entry while a pointer to the pte_t is retured for the PMD > entry. > > Update huge_pte_offset() to make the behaviour consistent - return NULL > in the case of p*d_none() and a pointer to the pte_t for hugepage or > swap entries. > > Document the behaviour to clarify the expected behaviour of this > function. This is to set clear semantics for architecture specific > implementations of huge_pte_offset(). hugetlb pte semantic is a disaster and I agree it could see some cleanup/clarifications but I am quite nervous to see a patchi like this. How do we check that nothing will get silently broken by this change? > Signed-off-by: Punit Agrawal > --- > mm/hugetlb.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index bc48ee783dd9..72dd1139a8e4 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4603,6 +4603,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > return pte; > } > > +/* > + * huge_pte_offset() - Walk the page table to resolve the hugepage > + * entry at address @addr > + * > + * Return: Pointer to page table or swap entry (PUD or PMD) for address @addr > + * or NULL if the entry is p*d_none(). > + */ > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz) > { > @@ -4617,13 +4624,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm, > p4d = p4d_offset(pgd, addr); > if (!p4d_present(*p4d)) > return NULL; > + > pud = pud_offset(p4d, addr); > - if (!pud_present(*pud)) > + if (pud_none(*pud)) > return NULL; > - if (pud_huge(*pud)) > + /* hugepage or swap? */ > + if (pud_huge(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > + > pmd = pmd_offset(pud, addr); > - return (pte_t *) pmd; > + if (pmd_none(*pmd)) > + return NULL; > + /* hugepage or swap? */ > + if (pmd_huge(*pmd) || !pmd_present(*pmd)) > + return (pte_t *) pmd; > + > + return NULL; > } > > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ > -- > 2.11.0 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id CA6DF6B02F4 for ; Wed, 26 Jul 2017 04:53:30 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id v102so31074325wrb.2 for ; Wed, 26 Jul 2017 01:53:30 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id j21si17585965wra.51.2017.07.26.01.53.29 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 26 Jul 2017 01:53:29 -0700 (PDT) Date: Wed, 26 Jul 2017 10:53:25 +0200 From: Michal Hocko Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour Message-ID: <20170726085325.GC2981@dhcp22.suse.cz> References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> <20170726085038.GB2981@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170726085038.GB2981@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal Cc: Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com, Mike Kravetz On Wed 26-07-17 10:50:38, Michal Hocko wrote: > On Tue 25-07-17 16:41:14, Punit Agrawal wrote: > > When walking the page tables to resolve an address that points to > > !p*d_present() entry, huge_pte_offset() returns inconsistent values > > depending on the level of page table (PUD or PMD). > > > > It returns NULL in the case of a PUD entry while in the case of a PMD > > entry, it returns a pointer to the page table entry. > > > > A similar inconsitency exists when handling swap entries - returns NULL > > for a PUD entry while a pointer to the pte_t is retured for the PMD > > entry. > > > > Update huge_pte_offset() to make the behaviour consistent - return NULL > > in the case of p*d_none() and a pointer to the pte_t for hugepage or > > swap entries. > > > > Document the behaviour to clarify the expected behaviour of this > > function. This is to set clear semantics for architecture specific > > implementations of huge_pte_offset(). > > hugetlb pte semantic is a disaster and I agree it could see some > cleanup/clarifications but I am quite nervous to see a patchi like this. > How do we check that nothing will get silently broken by this change? Forgot to add. Hugetlb have been special because of the pte sharing. I haven't looked into that code for quite some time but there might be a good reason why pud behave differently. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 8371A6B0497 for ; Wed, 26 Jul 2017 08:11:50 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id j79so183102972pfj.9 for ; Wed, 26 Jul 2017 05:11:50 -0700 (PDT) Received: from foss.arm.com (usa-sjc-mx-foss1.foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id f7si10452677pln.139.2017.07.26.05.11.49 for ; Wed, 26 Jul 2017 05:11:49 -0700 (PDT) From: Punit Agrawal Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> <20170726085038.GB2981@dhcp22.suse.cz> <20170726085325.GC2981@dhcp22.suse.cz> Date: Wed, 26 Jul 2017 13:11:46 +0100 In-Reply-To: <20170726085325.GC2981@dhcp22.suse.cz> (Michal Hocko's message of "Wed, 26 Jul 2017 10:53:25 +0200") Message-ID: <87bmo7jt31.fsf@e105922-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com, Mike Kravetz Hi Michal, Michal Hocko writes: > On Wed 26-07-17 10:50:38, Michal Hocko wrote: >> On Tue 25-07-17 16:41:14, Punit Agrawal wrote: >> > When walking the page tables to resolve an address that points to >> > !p*d_present() entry, huge_pte_offset() returns inconsistent values >> > depending on the level of page table (PUD or PMD). >> > >> > It returns NULL in the case of a PUD entry while in the case of a PMD >> > entry, it returns a pointer to the page table entry. >> > >> > A similar inconsitency exists when handling swap entries - returns NULL >> > for a PUD entry while a pointer to the pte_t is retured for the PMD >> > entry. >> > >> > Update huge_pte_offset() to make the behaviour consistent - return NULL >> > in the case of p*d_none() and a pointer to the pte_t for hugepage or >> > swap entries. >> > >> > Document the behaviour to clarify the expected behaviour of this >> > function. This is to set clear semantics for architecture specific >> > implementations of huge_pte_offset(). >> >> hugetlb pte semantic is a disaster and I agree it could see some >> cleanup/clarifications but I am quite nervous to see a patchi like this. >> How do we check that nothing will get silently broken by this change? Glad I'm not the only one who finds the hugetlb semantics somewhat confusing. :) I've been running tests from mce-test suite and libhugetlbfs for similar changes we did on arm64. There could be assumptions that were not exercised but I'm not sure how to check for all the possible usages. Do you have any other suggestions that can help improve confidence in the patch? > > Forgot to add. Hugetlb have been special because of the pte sharing. I > haven't looked into that code for quite some time but there might be a > good reason why pud behave differently. I checked the code and don't see anything that would explain (or require) the difference in behaviour. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id 113646B0313 for ; Wed, 26 Jul 2017 08:34:02 -0400 (EDT) Received: by mail-wm0-f70.google.com with SMTP id e204so7343272wma.2 for ; Wed, 26 Jul 2017 05:34:02 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 2si13463901wrh.288.2017.07.26.05.34.00 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 26 Jul 2017 05:34:00 -0700 (PDT) Date: Wed, 26 Jul 2017 14:33:57 +0200 From: Michal Hocko Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour Message-ID: <20170726123357.GP2981@dhcp22.suse.cz> References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> <20170726085038.GB2981@dhcp22.suse.cz> <20170726085325.GC2981@dhcp22.suse.cz> <87bmo7jt31.fsf@e105922-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87bmo7jt31.fsf@e105922-lin.cambridge.arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal Cc: Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com, Mike Kravetz On Wed 26-07-17 13:11:46, Punit Agrawal wrote: > Hi Michal, > > Michal Hocko writes: > > > On Wed 26-07-17 10:50:38, Michal Hocko wrote: > >> On Tue 25-07-17 16:41:14, Punit Agrawal wrote: > >> > When walking the page tables to resolve an address that points to > >> > !p*d_present() entry, huge_pte_offset() returns inconsistent values > >> > depending on the level of page table (PUD or PMD). > >> > > >> > It returns NULL in the case of a PUD entry while in the case of a PMD > >> > entry, it returns a pointer to the page table entry. > >> > > >> > A similar inconsitency exists when handling swap entries - returns NULL > >> > for a PUD entry while a pointer to the pte_t is retured for the PMD > >> > entry. > >> > > >> > Update huge_pte_offset() to make the behaviour consistent - return NULL > >> > in the case of p*d_none() and a pointer to the pte_t for hugepage or > >> > swap entries. > >> > > >> > Document the behaviour to clarify the expected behaviour of this > >> > function. This is to set clear semantics for architecture specific > >> > implementations of huge_pte_offset(). > >> > >> hugetlb pte semantic is a disaster and I agree it could see some > >> cleanup/clarifications but I am quite nervous to see a patchi like this. > >> How do we check that nothing will get silently broken by this change? > > Glad I'm not the only one who finds the hugetlb semantics somewhat > confusing. :) This is a huge understatement. It is a source of nightmares. > I've been running tests from mce-test suite and libhugetlbfs for similar > changes we did on arm64. There could be assumptions that were not > exercised but I'm not sure how to check for all the possible usages. > > Do you have any other suggestions that can help improve confidence in > the patch? Unfortunatelly I don't. I just know there were many subtle assumptions all over the place so I am rather careful to not touch the code unless really necessary. That being said, I am not opposing your patch. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 932806B02B4 for ; Wed, 26 Jul 2017 08:47:08 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id u89so31716076wrc.1 for ; Wed, 26 Jul 2017 05:47:08 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 200si8843626wmf.64.2017.07.26.05.47.06 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 26 Jul 2017 05:47:07 -0700 (PDT) Date: Wed, 26 Jul 2017 14:47:04 +0200 From: Michal Hocko Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour Message-ID: <20170726124704.GQ2981@dhcp22.suse.cz> References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> <20170726085038.GB2981@dhcp22.suse.cz> <20170726085325.GC2981@dhcp22.suse.cz> <87bmo7jt31.fsf@e105922-lin.cambridge.arm.com> <20170726123357.GP2981@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170726123357.GP2981@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal Cc: Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com, Mike Kravetz On Wed 26-07-17 14:33:57, Michal Hocko wrote: > On Wed 26-07-17 13:11:46, Punit Agrawal wrote: [...] > > I've been running tests from mce-test suite and libhugetlbfs for similar > > changes we did on arm64. There could be assumptions that were not > > exercised but I'm not sure how to check for all the possible usages. > > > > Do you have any other suggestions that can help improve confidence in > > the patch? > > Unfortunatelly I don't. I just know there were many subtle assumptions > all over the place so I am rather careful to not touch the code unless > really necessary. > > That being said, I am not opposing your patch. Let me be more specific. I am not opposing your patch but we should definitely need more reviewers to have a look. I am not seeing any immediate problems with it but I do not see a large improvements either (slightly less nightmare doesn't make me sleep all that well ;)). So I will leave the decisions to others. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id DE9516B0387 for ; Wed, 26 Jul 2017 09:34:31 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id k190so215151404pgk.8 for ; Wed, 26 Jul 2017 06:34:31 -0700 (PDT) Received: from foss.arm.com (usa-sjc-mx-foss1.foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id a8si10232318ple.118.2017.07.26.06.34.30 for ; Wed, 26 Jul 2017 06:34:31 -0700 (PDT) From: Punit Agrawal Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> <20170726085038.GB2981@dhcp22.suse.cz> <20170726085325.GC2981@dhcp22.suse.cz> <87bmo7jt31.fsf@e105922-lin.cambridge.arm.com> <20170726123357.GP2981@dhcp22.suse.cz> <20170726124704.GQ2981@dhcp22.suse.cz> Date: Wed, 26 Jul 2017 14:34:27 +0100 In-Reply-To: <20170726124704.GQ2981@dhcp22.suse.cz> (Michal Hocko's message of "Wed, 26 Jul 2017 14:47:04 +0200") Message-ID: <8760efjp98.fsf@e105922-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com, Mike Kravetz Michal Hocko writes: > On Wed 26-07-17 14:33:57, Michal Hocko wrote: >> On Wed 26-07-17 13:11:46, Punit Agrawal wrote: > [...] >> > I've been running tests from mce-test suite and libhugetlbfs for similar >> > changes we did on arm64. There could be assumptions that were not >> > exercised but I'm not sure how to check for all the possible usages. >> > >> > Do you have any other suggestions that can help improve confidence in >> > the patch? >> >> Unfortunatelly I don't. I just know there were many subtle assumptions >> all over the place so I am rather careful to not touch the code unless >> really necessary. >> >> That being said, I am not opposing your patch. > > Let me be more specific. I am not opposing your patch but we should > definitely need more reviewers to have a look. I am not seeing any > immediate problems with it but I do not see a large improvements either > (slightly less nightmare doesn't make me sleep all that well ;)). So I > will leave the decisions to others. I hear you - I'd definitely appreciate more eyes on the code change and description. Thanks for taking a look. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f200.google.com (mail-qt0-f200.google.com [209.85.216.200]) by kanga.kvack.org (Postfix) with ESMTP id DB2256B02B4 for ; Wed, 26 Jul 2017 23:16:45 -0400 (EDT) Received: by mail-qt0-f200.google.com with SMTP id i19so60800608qte.5 for ; Wed, 26 Jul 2017 20:16:45 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id 136si14391649qkg.70.2017.07.26.20.16.44 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 26 Jul 2017 20:16:45 -0700 (PDT) Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> <20170726085038.GB2981@dhcp22.suse.cz> <20170726085325.GC2981@dhcp22.suse.cz> <87bmo7jt31.fsf@e105922-lin.cambridge.arm.com> <20170726123357.GP2981@dhcp22.suse.cz> <20170726124704.GQ2981@dhcp22.suse.cz> <8760efjp98.fsf@e105922-lin.cambridge.arm.com> From: Mike Kravetz Message-ID: <9b3b3585-f984-e592-122c-ed23c8558069@oracle.com> Date: Wed, 26 Jul 2017 20:16:31 -0700 MIME-Version: 1.0 In-Reply-To: <8760efjp98.fsf@e105922-lin.cambridge.arm.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal , Michal Hocko Cc: Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com On 07/26/2017 06:34 AM, Punit Agrawal wrote: > Michal Hocko writes: > >> On Wed 26-07-17 14:33:57, Michal Hocko wrote: >>> On Wed 26-07-17 13:11:46, Punit Agrawal wrote: >> [...] >>>> I've been running tests from mce-test suite and libhugetlbfs for similar >>>> changes we did on arm64. There could be assumptions that were not >>>> exercised but I'm not sure how to check for all the possible usages. >>>> >>>> Do you have any other suggestions that can help improve confidence in >>>> the patch? >>> >>> Unfortunatelly I don't. I just know there were many subtle assumptions >>> all over the place so I am rather careful to not touch the code unless >>> really necessary. >>> >>> That being said, I am not opposing your patch. >> >> Let me be more specific. I am not opposing your patch but we should >> definitely need more reviewers to have a look. I am not seeing any >> immediate problems with it but I do not see a large improvements either >> (slightly less nightmare doesn't make me sleep all that well ;)). So I >> will leave the decisions to others. > > I hear you - I'd definitely appreciate more eyes on the code change and > description. I like the change in semantics for the routine. Like you, I examined all callers of huge_pte_offset() and it appears that they will not be impacted by your change. My only concern is that arch specific versions of huge_pte_offset, may not (yet) follow the new semantic. Someone could potentially introduce a new huge_pte_offset call and depend on the new 'documented' semantics. Yet, an unmodified arch specific version of huge_pte_offset might have different semantics. I have not reviewed all the arch specific instances of the routine to know if this is even possible. Just curious if you examined these, or perhaps you think this is not an issue? -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 8AAAA6B0292 for ; Thu, 27 Jul 2017 08:58:29 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id c87so37500638pfd.14 for ; Thu, 27 Jul 2017 05:58:29 -0700 (PDT) Received: from foss.arm.com (foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id d127si10894358pfg.316.2017.07.27.05.58.28 for ; Thu, 27 Jul 2017 05:58:28 -0700 (PDT) From: Punit Agrawal Subject: Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour References: <20170725154114.24131-1-punit.agrawal@arm.com> <20170725154114.24131-2-punit.agrawal@arm.com> <20170726085038.GB2981@dhcp22.suse.cz> <20170726085325.GC2981@dhcp22.suse.cz> <87bmo7jt31.fsf@e105922-lin.cambridge.arm.com> <20170726123357.GP2981@dhcp22.suse.cz> <20170726124704.GQ2981@dhcp22.suse.cz> <8760efjp98.fsf@e105922-lin.cambridge.arm.com> <9b3b3585-f984-e592-122c-ed23c8558069@oracle.com> Date: Thu, 27 Jul 2017 13:58:25 +0100 In-Reply-To: <9b3b3585-f984-e592-122c-ed23c8558069@oracle.com> (Mike Kravetz's message of "Wed, 26 Jul 2017 20:16:31 -0700") Message-ID: <87o9s6hw9a.fsf@e105922-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Michal Hocko , Andrew Morton , Naoya Horiguchi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steve.capper@arm.com, will.deacon@arm.com, catalin.marinas@arm.com, kirill.shutemov@linux.intel.com Mike Kravetz writes: > On 07/26/2017 06:34 AM, Punit Agrawal wrote: >> Michal Hocko writes: >> >>> On Wed 26-07-17 14:33:57, Michal Hocko wrote: >>>> On Wed 26-07-17 13:11:46, Punit Agrawal wrote: >>> [...] >>>>> I've been running tests from mce-test suite and libhugetlbfs for similar >>>>> changes we did on arm64. There could be assumptions that were not >>>>> exercised but I'm not sure how to check for all the possible usages. >>>>> >>>>> Do you have any other suggestions that can help improve confidence in >>>>> the patch? >>>> >>>> Unfortunatelly I don't. I just know there were many subtle assumptions >>>> all over the place so I am rather careful to not touch the code unless >>>> really necessary. >>>> >>>> That being said, I am not opposing your patch. >>> >>> Let me be more specific. I am not opposing your patch but we should >>> definitely need more reviewers to have a look. I am not seeing any >>> immediate problems with it but I do not see a large improvements either >>> (slightly less nightmare doesn't make me sleep all that well ;)). So I >>> will leave the decisions to others. >> >> I hear you - I'd definitely appreciate more eyes on the code change and >> description. > > I like the change in semantics for the routine. Like you, I examined all > callers of huge_pte_offset() and it appears that they will not be impacted > by your change. > > My only concern is that arch specific versions of huge_pte_offset, may > not (yet) follow the new semantic. Someone could potentially introduce > a new huge_pte_offset call and depend on the new 'documented' semantics. > Yet, an unmodified arch specific version of huge_pte_offset might have > different semantics. I have not reviewed all the arch specific instances > of the routine to know if this is even possible. Just curious if you > examined these, or perhaps you think this is not an issue? >>From checking through the implementations of huge_pte_offset() architectures, the change shouldn't break anything. (I also cc'd the posting to linux-arch for architecture maintainers to take more notice). This is because existing users actively deal with the different returned values (NULL, huge pte_t*, swap pte_t*) and are not checking explicitly for pmd or pud. Guarding against future users is more tricky - it would definitely help to align all the implementations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id DE6E56B02C3 for ; Fri, 18 Aug 2017 10:54:43 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id y129so176050165pgy.1 for ; Fri, 18 Aug 2017 07:54:43 -0700 (PDT) Received: from foss.arm.com (foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id k33si2586649pld.372.2017.08.18.07.54.42 for ; Fri, 18 Aug 2017 07:54:42 -0700 (PDT) From: Punit Agrawal Subject: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour Date: Fri, 18 Aug 2017 15:54:15 +0100 Message-Id: <20170818145415.7588-1-punit.agrawal@arm.com> In-Reply-To: <20170725154114.24131-2-punit.agrawal@arm.com> References: <20170725154114.24131-2-punit.agrawal@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Punit Agrawal , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Catalin Marinas , Naoya Horiguchi , Steve Capper , Will Deacon , "Kirill A . Shutemov" , Michal Hocko , Mike Kravetz When walking the page tables to resolve an address that points to !p*d_present() entry, huge_pte_offset() returns inconsistent values depending on the level of page table (PUD or PMD). It returns NULL in the case of a PUD entry while in the case of a PMD entry, it returns a pointer to the page table entry. A similar inconsitency exists when handling swap entries - returns NULL for a PUD entry while a pointer to the pte_t is retured for the PMD entry. Update huge_pte_offset() to make the behaviour consistent - return a pointer to the pte_t for hugepage or swap entries. Only return NULL in instances where we have a p*d_none() entry and the size parameter doesn't match the hugepage size at this level of the page table. Document the behaviour to clarify the expected behaviour of this function. This is to set clear semantics for architecture specific implementations of huge_pte_offset(). Signed-off-by: Punit Agrawal Cc: Catalin Marinas Cc: Naoya Horiguchi Cc: Steve Capper Cc: Will Deacon Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Mike Kravetz --- Hi Andrew, >>From discussions on the arm64 implementation of huge_pte_offset()[0] we realised that there is benefit from returning a pte_t* in the case of p*d_none(). The fault handling code in hugetlb_fault() can handle p*d_none() entries and saves an extra round trip to huge_pte_alloc(). Other callers of huge_pte_offset() should be ok as well. Apologies for sending a late update but I thought if we are defining the semantics, it's worth getting them right. Could you please pick this version please? Thanks, Punit [0] http://www.spinics.net/lists/linux-mm/msg133699.html v2: mm/hugetlb.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 31e207cb399b..1d54a131bdd5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, return pte; } +/* + * huge_pte_offset() - Walk the page table to resolve the hugepage + * entry at address @addr + * + * Return: Pointer to page table or swap entry (PUD or PMD) for + * address @addr, or NULL if a p*d_none() entry is encountered and the + * size @sz doesn't match the hugepage size at this level of the page + * table. + */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz) { @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm, p4d = p4d_offset(pgd, addr); if (!p4d_present(*p4d)) return NULL; + pud = pud_offset(p4d, addr); - if (!pud_present(*pud)) + if (sz != PUD_SIZE && pud_none(*pud)) return NULL; - if (pud_huge(*pud)) + /* hugepage or swap? */ + if (pud_huge(*pud) || !pud_present(*pud)) return (pte_t *)pud; + pmd = pmd_offset(pud, addr); - return (pte_t *) pmd; + if (sz != PMD_SIZE && pmd_none(*pmd)) + return NULL; + /* hugepage or swap? */ + if (pmd_huge(*pmd) || !pmd_present(*pmd)) + return (pte_t *)pmd; + + return NULL; } #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ -- 2.13.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f197.google.com (mail-qk0-f197.google.com [209.85.220.197]) by kanga.kvack.org (Postfix) with ESMTP id 168EB6B04AD for ; Fri, 18 Aug 2017 17:29:32 -0400 (EDT) Received: by mail-qk0-f197.google.com with SMTP id s18so52170303qks.4 for ; Fri, 18 Aug 2017 14:29:32 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id t39si6138222qtb.353.2017.08.18.14.29.30 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Aug 2017 14:29:31 -0700 (PDT) Subject: Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour References: <20170725154114.24131-2-punit.agrawal@arm.com> <20170818145415.7588-1-punit.agrawal@arm.com> From: Mike Kravetz Message-ID: <3de49294-f6f8-2623-1778-56a3b092f2a5@oracle.com> Date: Fri, 18 Aug 2017 14:29:18 -0700 MIME-Version: 1.0 In-Reply-To: <20170818145415.7588-1-punit.agrawal@arm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal , Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Catalin Marinas , Naoya Horiguchi , Steve Capper , Will Deacon , "Kirill A . Shutemov" , Michal Hocko On 08/18/2017 07:54 AM, Punit Agrawal wrote: > When walking the page tables to resolve an address that points to > !p*d_present() entry, huge_pte_offset() returns inconsistent values > depending on the level of page table (PUD or PMD). > > It returns NULL in the case of a PUD entry while in the case of a PMD > entry, it returns a pointer to the page table entry. > > A similar inconsitency exists when handling swap entries - returns NULL > for a PUD entry while a pointer to the pte_t is retured for the PMD entry. > > Update huge_pte_offset() to make the behaviour consistent - return a > pointer to the pte_t for hugepage or swap entries. Only return NULL in > instances where we have a p*d_none() entry and the size parameter > doesn't match the hugepage size at this level of the page table. > > Document the behaviour to clarify the expected behaviour of this function. > This is to set clear semantics for architecture specific implementations > of huge_pte_offset(). > > Signed-off-by: Punit Agrawal > Cc: Catalin Marinas > Cc: Naoya Horiguchi > Cc: Steve Capper > Cc: Will Deacon > Cc: Kirill A. Shutemov > Cc: Michal Hocko > Cc: Mike Kravetz > --- > > Hi Andrew, > > From discussions on the arm64 implementation of huge_pte_offset()[0] > we realised that there is benefit from returning a pte_t* in the case > of p*d_none(). > > The fault handling code in hugetlb_fault() can handle p*d_none() > entries and saves an extra round trip to huge_pte_alloc(). Other > callers of huge_pte_offset() should be ok as well. Yes, this change would eliminate that call to huge_pte_alloc() in hugetlb_fault(). However, huge_pte_offset() is now returning a pointer to a p*d_none() pte in some instances where it would have previously returned NULL. Correct? I went through the callers, and like you am fairly confident that they can handle this situation. But, returning p*d_none() instead of NULL does change the execution path in several routines such as copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection, and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these routines, they do a quick continue, exit, etc. If they are returned a pointer, they typically lock the page table(s) and then check for p*d_none() before continuing, exiting, etc. So, it appears that these routines could potentially slow down a bit with this change (in the specific case of p*d_none). I 'think' one could argue that the the fault case is more important. So, the savings there would outweigh any potential slowdown in the other routines. IMO, this new version of the patch has more potential for issues than the previous version. It would be helpful if others could take a look. One thing I am still 'thinking' about is how this patch could potentially change behavior in huge_pmd_share. With the patch, pmd sharing could potentially be set up in situations (pmd_none) where it previously would not have been set up. I don't think this is an issue, but any changes to this concerns me. -- Mike Kravetz > > Apologies for sending a late update but I thought if we are defining > the semantics, it's worth getting them right. > > Could you please pick this version please? > > Thanks, > Punit > > [0] http://www.spinics.net/lists/linux-mm/msg133699.html > > v2: > > mm/hugetlb.c | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 31e207cb399b..1d54a131bdd5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > return pte; > } > > +/* > + * huge_pte_offset() - Walk the page table to resolve the hugepage > + * entry at address @addr > + * > + * Return: Pointer to page table or swap entry (PUD or PMD) for > + * address @addr, or NULL if a p*d_none() entry is encountered and the > + * size @sz doesn't match the hugepage size at this level of the page > + * table. > + */ > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz) > { > @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm, > p4d = p4d_offset(pgd, addr); > if (!p4d_present(*p4d)) > return NULL; > + > pud = pud_offset(p4d, addr); > - if (!pud_present(*pud)) > + if (sz != PUD_SIZE && pud_none(*pud)) > return NULL; > - if (pud_huge(*pud)) > + /* hugepage or swap? */ > + if (pud_huge(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > + > pmd = pmd_offset(pud, addr); > - return (pte_t *) pmd; > + if (sz != PMD_SIZE && pmd_none(*pmd)) > + return NULL; > + /* hugepage or swap? */ > + if (pmd_huge(*pmd) || !pmd_present(*pmd)) > + return (pte_t *)pmd; > + > + return NULL; > } > > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 91612280310 for ; Mon, 21 Aug 2017 14:07:49 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id r133so290959979pgr.6 for ; Mon, 21 Aug 2017 11:07:49 -0700 (PDT) Received: from foss.arm.com (foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id h16si8312346pli.448.2017.08.21.11.07.47 for ; Mon, 21 Aug 2017 11:07:48 -0700 (PDT) Date: Mon, 21 Aug 2017 19:07:42 +0100 From: Catalin Marinas Subject: Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour Message-ID: <20170821180741.4ns2s4wp3t2r6mpi@armageddon.cambridge.arm.com> References: <20170725154114.24131-2-punit.agrawal@arm.com> <20170818145415.7588-1-punit.agrawal@arm.com> <3de49294-f6f8-2623-1778-56a3b092f2a5@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3de49294-f6f8-2623-1778-56a3b092f2a5@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Punit Agrawal , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Naoya Horiguchi , Steve Capper , Will Deacon , "Kirill A . Shutemov" , Michal Hocko On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote: > On 08/18/2017 07:54 AM, Punit Agrawal wrote: > > When walking the page tables to resolve an address that points to > > !p*d_present() entry, huge_pte_offset() returns inconsistent values > > depending on the level of page table (PUD or PMD). > > > > It returns NULL in the case of a PUD entry while in the case of a PMD > > entry, it returns a pointer to the page table entry. > > > > A similar inconsitency exists when handling swap entries - returns NULL > > for a PUD entry while a pointer to the pte_t is retured for the PMD entry. > > > > Update huge_pte_offset() to make the behaviour consistent - return a > > pointer to the pte_t for hugepage or swap entries. Only return NULL in > > instances where we have a p*d_none() entry and the size parameter > > doesn't match the hugepage size at this level of the page table. > > > > Document the behaviour to clarify the expected behaviour of this function. > > This is to set clear semantics for architecture specific implementations > > of huge_pte_offset(). > > > > Signed-off-by: Punit Agrawal > > Cc: Catalin Marinas > > Cc: Naoya Horiguchi > > Cc: Steve Capper > > Cc: Will Deacon > > Cc: Kirill A. Shutemov > > Cc: Michal Hocko > > Cc: Mike Kravetz > > --- > > > > Hi Andrew, > > > > From discussions on the arm64 implementation of huge_pte_offset()[0] > > we realised that there is benefit from returning a pte_t* in the case > > of p*d_none(). > > > > The fault handling code in hugetlb_fault() can handle p*d_none() > > entries and saves an extra round trip to huge_pte_alloc(). Other > > callers of huge_pte_offset() should be ok as well. > > Yes, this change would eliminate that call to huge_pte_alloc() in > hugetlb_fault(). However, huge_pte_offset() is now returning a pointer > to a p*d_none() pte in some instances where it would have previously > returned NULL. Correct? Yes (whether it was previously the right thing to return is a different matter; that's what we are trying to clarify in the generic code so that we can have similar semantics on arm64). > I went through the callers, and like you am fairly confident that they > can handle this situation. But, returning p*d_none() instead of NULL > does change the execution path in several routines such as > copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection, > and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these > routines, they do a quick continue, exit, etc. If they are returned > a pointer, they typically lock the page table(s) and then check for > p*d_none() before continuing, exiting, etc. So, it appears that these > routines could potentially slow down a bit with this change (in the specific > case of p*d_none). Arguably (well, my interpretation), it should return a NULL only if the entry is a table entry, potentially pointing to a next level (pmd). In the pud case, this means that sz < PUD_SIZE. If the pud is a last level huge page entry (either present or !present), huge_pte_offset() should return the pointer to it and never NULL. If the entry is a swap or migration one (pte_present() == false) with the current code we don't even enter the corresponding checks in copy_hugetlb_page_range(). I also assume that the ptl __unmap_hugepage_range() is taken to avoid some race when the entry is a huge page (present or not). If such race doesn't exist, we could as well check the huge_pte_none() outside the locked region (which is what the current huge_pte_offset() does with !pud_present()). IMHO, while the current generic huge_pte_offset() avoids some code paths in the functions you mentioned, the results are not always correct (missing swap/migration entries or potentially racy). -- Catalin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f197.google.com (mail-yw0-f197.google.com [209.85.161.197]) by kanga.kvack.org (Postfix) with ESMTP id 3E417280422 for ; Mon, 21 Aug 2017 17:30:47 -0400 (EDT) Received: by mail-yw0-f197.google.com with SMTP id t188so110563855ywb.10 for ; Mon, 21 Aug 2017 14:30:47 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id h4si1166497ybj.372.2017.08.21.14.30.45 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 21 Aug 2017 14:30:46 -0700 (PDT) Subject: Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour References: <20170725154114.24131-2-punit.agrawal@arm.com> <20170818145415.7588-1-punit.agrawal@arm.com> <3de49294-f6f8-2623-1778-56a3b092f2a5@oracle.com> <20170821180741.4ns2s4wp3t2r6mpi@armageddon.cambridge.arm.com> From: Mike Kravetz Message-ID: Date: Mon, 21 Aug 2017 14:30:33 -0700 MIME-Version: 1.0 In-Reply-To: <20170821180741.4ns2s4wp3t2r6mpi@armageddon.cambridge.arm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Catalin Marinas Cc: Punit Agrawal , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Naoya Horiguchi , Steve Capper , Will Deacon , "Kirill A . Shutemov" , Michal Hocko On 08/21/2017 11:07 AM, Catalin Marinas wrote: > On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote: >> On 08/18/2017 07:54 AM, Punit Agrawal wrote: >>> When walking the page tables to resolve an address that points to >>> !p*d_present() entry, huge_pte_offset() returns inconsistent values >>> depending on the level of page table (PUD or PMD). >>> >>> It returns NULL in the case of a PUD entry while in the case of a PMD >>> entry, it returns a pointer to the page table entry. >>> >>> A similar inconsitency exists when handling swap entries - returns NULL >>> for a PUD entry while a pointer to the pte_t is retured for the PMD entry. >>> >>> Update huge_pte_offset() to make the behaviour consistent - return a >>> pointer to the pte_t for hugepage or swap entries. Only return NULL in >>> instances where we have a p*d_none() entry and the size parameter >>> doesn't match the hugepage size at this level of the page table. >>> >>> Document the behaviour to clarify the expected behaviour of this function. >>> This is to set clear semantics for architecture specific implementations >>> of huge_pte_offset(). >>> >>> Signed-off-by: Punit Agrawal >>> Cc: Catalin Marinas >>> Cc: Naoya Horiguchi >>> Cc: Steve Capper >>> Cc: Will Deacon >>> Cc: Kirill A. Shutemov >>> Cc: Michal Hocko >>> Cc: Mike Kravetz >>> --- >>> >>> Hi Andrew, >>> >>> From discussions on the arm64 implementation of huge_pte_offset()[0] >>> we realised that there is benefit from returning a pte_t* in the case >>> of p*d_none(). >>> >>> The fault handling code in hugetlb_fault() can handle p*d_none() >>> entries and saves an extra round trip to huge_pte_alloc(). Other >>> callers of huge_pte_offset() should be ok as well. >> >> Yes, this change would eliminate that call to huge_pte_alloc() in >> hugetlb_fault(). However, huge_pte_offset() is now returning a pointer >> to a p*d_none() pte in some instances where it would have previously >> returned NULL. Correct? > > Yes (whether it was previously the right thing to return is a different > matter; that's what we are trying to clarify in the generic code so that > we can have similar semantics on arm64). > >> I went through the callers, and like you am fairly confident that they >> can handle this situation. But, returning p*d_none() instead of NULL >> does change the execution path in several routines such as >> copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection, >> and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these >> routines, they do a quick continue, exit, etc. If they are returned >> a pointer, they typically lock the page table(s) and then check for >> p*d_none() before continuing, exiting, etc. So, it appears that these >> routines could potentially slow down a bit with this change (in the specific >> case of p*d_none). > > Arguably (well, my interpretation), it should return a NULL only if the > entry is a table entry, potentially pointing to a next level (pmd). In > the pud case, this means that sz < PUD_SIZE. > > If the pud is a last level huge page entry (either present or !present), > huge_pte_offset() should return the pointer to it and never NULL. If the > entry is a swap or migration one (pte_present() == false) with the > current code we don't even enter the corresponding checks in > copy_hugetlb_page_range(). > > I also assume that the ptl __unmap_hugepage_range() is taken to avoid > some race when the entry is a huge page (present or not). If such race > doesn't exist, we could as well check the huge_pte_none() outside the > locked region (which is what the current huge_pte_offset() does with > !pud_present()). > > IMHO, while the current generic huge_pte_offset() avoids some code paths > in the functions you mentioned, the results are not always correct > (missing swap/migration entries or potentially racy). Thanks Catalin, The more I look at this code and think about it, the more I like it. As Michal previously mentioned, changes in this area can break things in subtle ways. That is why I was cautious and asked for more people to look at it. My primary concerns with these changes in this area were: - Any potential changes in behavior. I think this has been sufficiently explored. While there may be small differences in behavior (for the better), this change should not introduce any bugs/breakage. - Other arch specific implementations are not aligned with the new behavior. Again, this should not cause any issues. Punit (and I) have looked at the arch specific implementations for issues and found none. In addition, since we are not changing any of the 'calling code', no issues should be introduced for arch specific implementations. I like the new semantics and did not find any issues. Reviewed-by: Mike Kravetz -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 99C00280310 for ; Tue, 22 Aug 2017 06:11:25 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id o82so68066056pfj.11 for ; Tue, 22 Aug 2017 03:11:25 -0700 (PDT) Received: from foss.arm.com (foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id h5si9647733pln.768.2017.08.22.03.11.24 for ; Tue, 22 Aug 2017 03:11:24 -0700 (PDT) Date: Tue, 22 Aug 2017 11:11:18 +0100 From: Catalin Marinas Subject: Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour Message-ID: <20170822101117.ilnys32tugytbbjc@armageddon.cambridge.arm.com> References: <20170725154114.24131-2-punit.agrawal@arm.com> <20170818145415.7588-1-punit.agrawal@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170818145415.7588-1-punit.agrawal@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Naoya Horiguchi , Steve Capper , Will Deacon , "Kirill A . Shutemov" , Michal Hocko , Mike Kravetz On Fri, Aug 18, 2017 at 03:54:15PM +0100, Punit Agrawal wrote: > When walking the page tables to resolve an address that points to > !p*d_present() entry, huge_pte_offset() returns inconsistent values > depending on the level of page table (PUD or PMD). > > It returns NULL in the case of a PUD entry while in the case of a PMD > entry, it returns a pointer to the page table entry. > > A similar inconsitency exists when handling swap entries - returns NULL > for a PUD entry while a pointer to the pte_t is retured for the PMD entry. > > Update huge_pte_offset() to make the behaviour consistent - return a > pointer to the pte_t for hugepage or swap entries. Only return NULL in > instances where we have a p*d_none() entry and the size parameter > doesn't match the hugepage size at this level of the page table. > > Document the behaviour to clarify the expected behaviour of this function. > This is to set clear semantics for architecture specific implementations > of huge_pte_offset(). > > Signed-off-by: Punit Agrawal > Cc: Catalin Marinas > Cc: Naoya Horiguchi > Cc: Steve Capper > Cc: Will Deacon > Cc: Kirill A. Shutemov > Cc: Michal Hocko > Cc: Mike Kravetz FWIW: Reviewed-by: Catalin Marinas Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id A30702806E4 for ; Tue, 22 Aug 2017 11:32:44 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id q3so112464993pgr.3 for ; Tue, 22 Aug 2017 08:32:44 -0700 (PDT) Received: from foss.arm.com (usa-sjc-mx-foss1.foss.arm.com. [217.140.101.70]) by mx.google.com with ESMTP id z9si8903702pgo.45.2017.08.22.08.32.42 for ; Tue, 22 Aug 2017 08:32:42 -0700 (PDT) From: Punit Agrawal Subject: Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour References: <20170725154114.24131-2-punit.agrawal@arm.com> <20170818145415.7588-1-punit.agrawal@arm.com> <3de49294-f6f8-2623-1778-56a3b092f2a5@oracle.com> <20170821180741.4ns2s4wp3t2r6mpi@armageddon.cambridge.arm.com> Date: Tue, 22 Aug 2017 16:32:39 +0100 In-Reply-To: (Mike Kravetz's message of "Mon, 21 Aug 2017 14:30:33 -0700") Message-ID: <87shgjmxd4.fsf@e105922-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: Catalin Marinas , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Naoya Horiguchi , Steve Capper , Will Deacon , "Kirill A . Shutemov" , Michal Hocko Hi Mike, Mike Kravetz writes: > On 08/21/2017 11:07 AM, Catalin Marinas wrote: >> On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote: >>> On 08/18/2017 07:54 AM, Punit Agrawal wrote: >>>> When walking the page tables to resolve an address that points to >>>> !p*d_present() entry, huge_pte_offset() returns inconsistent values >>>> depending on the level of page table (PUD or PMD). >>>> >>>> It returns NULL in the case of a PUD entry while in the case of a PMD >>>> entry, it returns a pointer to the page table entry. >>>> >>>> A similar inconsitency exists when handling swap entries - returns NULL >>>> for a PUD entry while a pointer to the pte_t is retured for the PMD entry. >>>> >>>> Update huge_pte_offset() to make the behaviour consistent - return a >>>> pointer to the pte_t for hugepage or swap entries. Only return NULL in >>>> instances where we have a p*d_none() entry and the size parameter >>>> doesn't match the hugepage size at this level of the page table. >>>> >>>> Document the behaviour to clarify the expected behaviour of this function. >>>> This is to set clear semantics for architecture specific implementations >>>> of huge_pte_offset(). >>>> >>>> Signed-off-by: Punit Agrawal >>>> Cc: Catalin Marinas >>>> Cc: Naoya Horiguchi >>>> Cc: Steve Capper >>>> Cc: Will Deacon >>>> Cc: Kirill A. Shutemov >>>> Cc: Michal Hocko >>>> Cc: Mike Kravetz >>>> --- >>>> >>>> Hi Andrew, >>>> >>>> From discussions on the arm64 implementation of huge_pte_offset()[0] >>>> we realised that there is benefit from returning a pte_t* in the case >>>> of p*d_none(). >>>> >>>> The fault handling code in hugetlb_fault() can handle p*d_none() >>>> entries and saves an extra round trip to huge_pte_alloc(). Other >>>> callers of huge_pte_offset() should be ok as well. >>> >>> Yes, this change would eliminate that call to huge_pte_alloc() in >>> hugetlb_fault(). However, huge_pte_offset() is now returning a pointer >>> to a p*d_none() pte in some instances where it would have previously >>> returned NULL. Correct? >> >> Yes (whether it was previously the right thing to return is a different >> matter; that's what we are trying to clarify in the generic code so that >> we can have similar semantics on arm64). >> >>> I went through the callers, and like you am fairly confident that they >>> can handle this situation. But, returning p*d_none() instead of NULL >>> does change the execution path in several routines such as >>> copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection, >>> and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these >>> routines, they do a quick continue, exit, etc. If they are returned >>> a pointer, they typically lock the page table(s) and then check for >>> p*d_none() before continuing, exiting, etc. So, it appears that these >>> routines could potentially slow down a bit with this change (in the specific >>> case of p*d_none). >> >> Arguably (well, my interpretation), it should return a NULL only if the >> entry is a table entry, potentially pointing to a next level (pmd). In >> the pud case, this means that sz < PUD_SIZE. >> >> If the pud is a last level huge page entry (either present or !present), >> huge_pte_offset() should return the pointer to it and never NULL. If the >> entry is a swap or migration one (pte_present() == false) with the >> current code we don't even enter the corresponding checks in >> copy_hugetlb_page_range(). >> >> I also assume that the ptl __unmap_hugepage_range() is taken to avoid >> some race when the entry is a huge page (present or not). If such race >> doesn't exist, we could as well check the huge_pte_none() outside the >> locked region (which is what the current huge_pte_offset() does with >> !pud_present()). >> >> IMHO, while the current generic huge_pte_offset() avoids some code paths >> in the functions you mentioned, the results are not always correct >> (missing swap/migration entries or potentially racy). > > Thanks Catalin, > > The more I look at this code and think about it, the more I like it. As > Michal previously mentioned, changes in this area can break things in subtle > ways. That is why I was cautious and asked for more people to look at it. > My primary concerns with these changes in this area were: > - Any potential changes in behavior. I think this has been sufficiently > explored. While there may be small differences in behavior (for the > better), this change should not introduce any bugs/breakage. > - Other arch specific implementations are not aligned with the new > behavior. Again, this should not cause any issues. Punit (and I) have > looked at the arch specific implementations for issues and found none. > In addition, since we are not changing any of the 'calling code', no > issues should be introduced for arch specific implementations. > > I like the new semantics and did not find any issues. > > Reviewed-by: Mike Kravetz Thanks for reviewing the updated semantics against existing usage. I'll monitor the lists for any reported breakage but please do shout out if you notice any issues. Thanks, Punit -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 6C7DF6B0292 for ; Wed, 30 Aug 2017 03:49:47 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id k9so3951901wre.11 for ; Wed, 30 Aug 2017 00:49:47 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 73si1134035wmw.171.2017.08.30.00.49.45 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 30 Aug 2017 00:49:45 -0700 (PDT) Date: Wed, 30 Aug 2017 09:49:43 +0200 From: Michal Hocko Subject: Re: [PATCH v2] mm/hugetlb.c: make huge_pte_offset() consistent and document behaviour Message-ID: <20170830074943.f4jm42l2fdaordn2@dhcp22.suse.cz> References: <20170725154114.24131-2-punit.agrawal@arm.com> <20170818145415.7588-1-punit.agrawal@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170818145415.7588-1-punit.agrawal@arm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Punit Agrawal Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Catalin Marinas , Naoya Horiguchi , Steve Capper , Will Deacon , "Kirill A . Shutemov" , Mike Kravetz On Fri 18-08-17 15:54:15, Punit Agrawal wrote: > When walking the page tables to resolve an address that points to > !p*d_present() entry, huge_pte_offset() returns inconsistent values > depending on the level of page table (PUD or PMD). > > It returns NULL in the case of a PUD entry while in the case of a PMD > entry, it returns a pointer to the page table entry. > > A similar inconsitency exists when handling swap entries - returns NULL > for a PUD entry while a pointer to the pte_t is retured for the PMD entry. > > Update huge_pte_offset() to make the behaviour consistent - return a > pointer to the pte_t for hugepage or swap entries. Only return NULL in > instances where we have a p*d_none() entry and the size parameter > doesn't match the hugepage size at this level of the page table. > > Document the behaviour to clarify the expected behaviour of this function. > This is to set clear semantics for architecture specific implementations > of huge_pte_offset(). > > Signed-off-by: Punit Agrawal > Cc: Catalin Marinas > Cc: Naoya Horiguchi > Cc: Steve Capper > Cc: Will Deacon > Cc: Kirill A. Shutemov > Cc: Michal Hocko > Cc: Mike Kravetz I always thought that the weird semantic is a result of the hugetlb pte sharing. But now that I dug into history it has been added by 02b0ccef903e ("[PATCH] hugetlb: check p?d_present in huge_pte_offset()") for a completely different reason. I suspec the weird semantic just wasn't noticed back then. Anyway, I didn't find any problem with the patch Acked-by: Michal Hocko > --- > > Hi Andrew, > > >From discussions on the arm64 implementation of huge_pte_offset()[0] > we realised that there is benefit from returning a pte_t* in the case > of p*d_none(). > > The fault handling code in hugetlb_fault() can handle p*d_none() > entries and saves an extra round trip to huge_pte_alloc(). Other > callers of huge_pte_offset() should be ok as well. > > Apologies for sending a late update but I thought if we are defining > the semantics, it's worth getting them right. > > Could you please pick this version please? > > Thanks, > Punit > > [0] http://www.spinics.net/lists/linux-mm/msg133699.html > > v2: > > mm/hugetlb.c | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 31e207cb399b..1d54a131bdd5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4600,6 +4600,15 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > return pte; > } > > +/* > + * huge_pte_offset() - Walk the page table to resolve the hugepage > + * entry at address @addr > + * > + * Return: Pointer to page table or swap entry (PUD or PMD) for > + * address @addr, or NULL if a p*d_none() entry is encountered and the > + * size @sz doesn't match the hugepage size at this level of the page > + * table. > + */ > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz) > { > @@ -4614,13 +4623,22 @@ pte_t *huge_pte_offset(struct mm_struct *mm, > p4d = p4d_offset(pgd, addr); > if (!p4d_present(*p4d)) > return NULL; > + > pud = pud_offset(p4d, addr); > - if (!pud_present(*pud)) > + if (sz != PUD_SIZE && pud_none(*pud)) > return NULL; > - if (pud_huge(*pud)) > + /* hugepage or swap? */ > + if (pud_huge(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > + > pmd = pmd_offset(pud, addr); > - return (pte_t *) pmd; > + if (sz != PMD_SIZE && pmd_none(*pmd)) > + return NULL; > + /* hugepage or swap? */ > + if (pmd_huge(*pmd) || !pmd_present(*pmd)) > + return (pte_t *)pmd; > + > + return NULL; > } > > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ > -- > 2.13.2 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org