linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Yang Shi <shy828301@gmail.com>,
	Wang Yugui <wangyugui@e16-tech.com>,
	Matthew Wilcox <willy@infradead.org>,
	Alistair Popple <apopple@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>, Zi Yan <ziy@nvidia.com>,
	Peter Xu <peterx@redhat.com>, Will Deacon <will@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 03/11] mm: page_vma_mapped_walk(): use pmd_read_atomic()
Date: Fri, 11 Jun 2021 12:36:13 -0300	[thread overview]
Message-ID: <20210611153613.GR1096940@ziepe.ca> (raw)
In-Reply-To: <aebb6b96-153e-7d7-59da-f6bad4337aa7@google.com>

On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > > 
> > > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > > Cc: <stable@vger.kernel.org>
> > > >  mm/page_vma_mapped.c | 5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > +++ b/mm/page_vma_mapped.c
> > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > >  	pud = pud_offset(p4d, pvmw->address);
> > > >  	if (!pud_present(*pud))
> > > >  		return false;
> > > > +
> > > >  	pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > >  	/*
> > > >  	 * Make sure the pmd value isn't cached in a register by the
> > > >  	 * compiler and used as a stale value after we've observed a
> > > >  	 * subsequent update.
> > > >  	 */
> > > > -	pmde = READ_ONCE(*pvmw->pmd);
> > > > +	pmde = pmd_read_atomic(pvmw->pmd);
> > > > +	barrier();
> > > > +
> > > 
> > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> > > mm/hmm.c uses the same pattern as you are and I tend to think that the
> > > rest of pmd_read_atomic() users may be broken.
> > > 
> > > Am I wrong?
> > 
> > I agree with you, something called _atomic should not require the
> > caller to provide barriers.
> > 
> > I think the issue is simply that the two implementations of
> > pmd_read_atomic() should use READ_ONCE() internally, no?
> 
> I've had great difficulty coming up with answers for you.
> 
> This patch was based on two notions I've had lodged in my mind
> for several years:
> 
> 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t
> atomically (even if the pmd_t spans multiple words); but reading again
> after all this time the comment above it, it seems to be more specialized
> than I'd thought (biggest selling point being for when you want to check
> pmd_none(), which we don't).  And has no READ_ONCE() or barrier() inside,
> so really needs that barrier() to be as safe as the previous READ_ONCE().

IMHO it is a simple bug that the 64 bit version of this is not marked
with READ_ONCE() internally. It is reading data without a lock, AFAIK
our modern understanding of the memory model is that requires
READ_ONCE().

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24cb..0bf1fdec928e71 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
 	pmdval_t ret;
-	u32 *tmp = (u32 *)pmdp;
+	u32 *tmp = READ_ONCE((u32 *)pmdp);
 
 	ret = (pmdval_t) (*tmp);
 	if (ret) {
@@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 		 * or we can end up with a partial pmd.
 		 */
 		smp_rmb();
-		ret |= ((pmdval_t)*(tmp + 1)) << 32;
+		ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
 	}
 
 	return (pmd_t) { ret };
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 46b13780c2c8c9..c8c7a3307d2773 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 	 * only going to work, if the pmdval_t isn't larger than
 	 * an unsigned long.
 	 */
-	return *pmdp;
+	return READ_ONCE(*pmdp);
 }
 #endif
 

> 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(),
> because READ_ONCE() did not work (did not give the necessary guarantee? or
> did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE.

This is really interesting, the git history e37c69827063 ("mm: replace
ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was
dropped here "because it doesn't work on non-scalar types" due to a
(now 8 year old) gcc bug.

According to the gcc bug READ_ONCE() on anything that is a scalar
sized struct triggers GCC to ignore the READ_ONCEyness. To work around
this bug then READ_ONCE can never be used on any of the struct
protected page table elements. While I am not 100% sure, it looks like
this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum
required compiler this is all just cruft. Use READ_ONCE() here too...

> But I've now come across some changes that Will Deacon made last year:
> the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for
> native word type *or* type sizeof(long long) (e.g. i386 PAE) - given
> "a strong prevailing wind" anyway :)  And 8e958839e4b9 ("sparc32: mm:
> Restructure sparc32 MMU page-table layout") put an end to sparc32's
> typedef struct { unsigned long pmdv[16]; } pmd_t.

Indeed, that is good news

> As to your questions about pmd_read_atomic() usage elsewhere:
> please don't force me to think so hard!  (And you've set me half-
> wondering, whether there are sneaky THP transitions, perhaps of the
> "unstable" kind, that page_vma_mapped_walk() should be paying more
> attention to: but for sanity's sake I won't go there, not now.)

If I recall, (and I didn't recheck this right now) the only thing
pmd_read_atomic() provides is the special property that if the pmd's
flags are observed to point to a pte table then pmd_read_atomic() will
reliably return the pte table pointer.

Otherwise it returns the flags and a garbage pointer because under the
THP protocol a PMD pointing at a page can be converted to a PTE table
if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD
page requires mmap sem in write mode so once a PTE table is observed
'locklessly' the value is stable.. Or at least so says the documentation

pmd_read_atomic() is badly named, tricky to use, and missing the
READ_ONCE() because it is so old..

As far as is page_vma_mapped_walk correct.. Everything except
is_pmd_migration_entry() is fine to my eye, and I simply don't know
the rules aroudn migration entries to know right/wrong.

I suspect it probably is required to manipulate a migration entry
while holding the mmap_sem in write mode??

There is some list of changes to the page table that require holding
the mmap sem in write mode that I've never seen documented for..

Jason

  reply	other threads:[~2021-06-11 15:36 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-10  6:31 [PATCH 00/11] mm: page_vma_mapped_walk() cleanup and THP fixes Hugh Dickins
2021-06-10  6:34 ` [PATCH 01/11] mm: page_vma_mapped_walk(): use page for pvmw->page Hugh Dickins
2021-06-10  8:12   ` Alistair Popple
2021-06-10  8:55   ` Kirill A. Shutemov
2021-06-10 14:14     ` Peter Xu
2021-06-10 22:35       ` Hugh Dickins
2021-06-10  6:36 ` [PATCH 02/11] mm: page_vma_mapped_walk(): settle PageHuge on entry Hugh Dickins
2021-06-10  8:57   ` Kirill A. Shutemov
2021-06-10 14:17   ` Peter Xu
2021-06-10 22:45     ` Hugh Dickins
2021-06-10  6:38 ` [PATCH 03/11] mm: page_vma_mapped_walk(): use pmd_read_atomic() Hugh Dickins
2021-06-10  9:06   ` Kirill A. Shutemov
2021-06-10 12:15     ` Jason Gunthorpe
2021-06-11  6:37       ` Hugh Dickins
2021-06-11 15:36         ` Jason Gunthorpe [this message]
2021-06-11 19:05           ` Hugh Dickins
2021-06-11 19:42             ` Jason Gunthorpe
2021-06-15  9:46               ` Will Deacon
2021-06-16  0:42                 ` Jason Gunthorpe
2021-06-16 10:27                   ` Will Deacon
2021-06-11 19:33           ` Hugh Dickins
2021-06-10  6:40 ` [PATCH 04/11] mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd Hugh Dickins
2021-06-10  9:10   ` Kirill A. Shutemov
2021-06-10 14:31   ` Peter Xu
2021-06-10  6:42 ` [PATCH 05/11] mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block Hugh Dickins
2021-06-10  9:16   ` Kirill A. Shutemov
2021-06-10 14:48   ` Peter Xu
2021-06-10  6:44 ` [PATCH 06/11] mm: page_vma_mapped_walk(): crossing page table boundary Hugh Dickins
2021-06-10  9:32   ` Kirill A. Shutemov
2021-06-10 23:02     ` Hugh Dickins
2021-06-11 11:23       ` Kirill A. Shutemov
2021-06-10  6:46 ` [PATCH 07/11] mm: page_vma_mapped_walk(): add a level of indentation Hugh Dickins
2021-06-10  9:34   ` Kirill A. Shutemov
2021-06-10  6:48 ` [PATCH 08/11] mm: page_vma_mapped_walk(): use goto instead of while (1) Hugh Dickins
2021-06-10  9:39   ` Kirill A. Shutemov
2021-06-10  6:50 ` [PATCH 09/11] mm: page_vma_mapped_walk(): get vma_address_end() earlier Hugh Dickins
2021-06-10  9:40   ` Kirill A. Shutemov
2021-06-10  6:52 ` [PATCH 10/11] mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes Hugh Dickins
2021-06-10  9:42   ` Kirill A. Shutemov
2021-06-10  6:54 ` [PATCH 11/11] mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() Hugh Dickins
2021-06-10  9:43   ` Kirill A. Shutemov
2021-06-11 18:29     ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210611153613.GR1096940@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=rcampbell@nvidia.com \
    --cc=shy828301@gmail.com \
    --cc=wangyugui@e16-tech.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).