All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, John Hubbard <jhubbard@nvidia.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Hugh Dickins <hughd@google.com>, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast
Date: Wed, 31 Aug 2022 09:15:16 +0200	[thread overview]
Message-ID: <60bbe75b-8dd1-3c46-2f8f-c2407493ffb8@redhat.com> (raw)
In-Reply-To: <Yw5rwIUPm49XtqOB@nvidia.com>

On 30.08.22 21:57, Jason Gunthorpe wrote:
> On Tue, Aug 30, 2022 at 08:53:06PM +0200, David Hildenbrand wrote:
>> On 30.08.22 20:45, Jason Gunthorpe wrote:
>>> On Tue, Aug 30, 2022 at 08:23:52PM +0200, David Hildenbrand wrote:
>>>> ... and looking into the details of TLB flush and GUP-fast interaction
>>>> nowadays, that case is no longer relevant. A TLB flush is no longer
>>>> sufficient to stop concurrent GUP-fast ever since we introduced generic
>>>> RCU GUP-fast.
>>>
>>> Yes, we've had RCU GUP fast for a while, and it is more widely used
>>> now, IIRC.
>>>
>>> It has been a bit, but if I remember, GUP fast in RCU mode worked on a
>>> few principles:
>>>
>>>  - The PTE page must not be freed without RCU
>>>  - The PTE page content must be convertable to a struct page using the
>>>    usual rules (eg PTE Special)
>>>  - That struct page refcount may go from 0->1 inside the RCU
>>>  - In the case the refcount goes from 0->1 there must be sufficient
>>>    barriers such that GUP fast observing the refcount of 1 will also
>>>    observe the PTE entry has changed. ie before the refcount is
>>>    dropped in the zap it has to clear the PTE entry, the refcount
>>>    decr has to be a 'release' and the refcount incr in gup fast has be
>>>    to be an 'acquire'.
>>>  - The rest of the system must tolerate speculative refcount
>>>    increments from GUP on any random page
>>>> The basic idea being that if GUP fast obtains a valid reference on a
>>> page *and* the PTE entry has not changed then everything is fine.
>>>
>>> The tricks with TLB invalidation are just a "poor mans" RCU, and
>>> arguably these days aren't really needed since I think we could make
>>> everything use real RCU always without penalty if we really wanted.
>>>
>>> Today we can create a unique 'struct pagetable_page' as Matthew has
>>> been doing in other places that guarentees a rcu_head is always
>>> available for every page used in a page table. Using that we could
>>> drop the code in the TLB flusher that allocates memory for the
>>> rcu_head and hopes for the best. (Or even is the common struct page
>>> rcu_head already guarenteed to exist for pagetable pages now a days?)
>>>
>>> IMHO that is the main reason we still have the non-RCU mode at all..
>>
>>
>> Good, I managed to attract the attention of someone who understands that machinery :)
>>
>> While validating whether GUP-fast and PageAnonExclusive code work correctly,
>> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
>> improve PageAnonExclusive clearing (I think we're missing memory barriers to
>> make it work as expected in any possible case), but I also stumbled eventually
>> over a more generic issue that might need memory barriers.
>>
>> Any thoughts whether I am missing something or this is actually missing
>> memory barriers?
> 
> I don't like the use of smb_mb very much, I deliberately choose the
> more modern language of release/acquire because it makes it a lot
> clearer what barriers are doing..
> 
> So, if we dig into it, using what I said above, the atomic refcount is:
> 
> gup_pte_range()
>   try_grab_folio()
>    try_get_folio()
>     folio_ref_try_add_rcu()
>      folio_ref_add_unless()
>        page_ref_add_unless()
>         atomic_add_unless()

Right, that seems to work as expected for checking the refcount after
clearing the PTE.

Unfortunately, it's not sufficien to identify whether a page may be
pinned, because the flow continues as

folio = try_get_folio(page, refs)
...
if (folio_test_large(folio))
	atomic_add(refs, folio_pincount_ptr(folio));
else
	folio_ref_add(folio, ...)

So I guess we'd need a smb_mb() before re-checking the PTE for that case.

> 
> So that wants to be an acquire
> 
> The pairing release is in the page table code that does the put_page,
> it wants to be an atomic_dec_return() as a release.
> 
> Now, we go and look at Documentation/atomic_t.txt to try to understand
> what are the ordering semantics of the atomics we are using and become
> dazed-confused like me:

I read that 3 times and got dizzy. Thanks for double-checking, very much
appreciated!

> 
>  ORDERING  (go read memory-barriers.txt first)
>  --------
> 
>   - RMW operations that have a return value are fully ordered;
> 
>   - RMW operations that are conditional are unordered on FAILURE,
>     otherwise the above rules apply.
> 
>  Fully ordered primitives are ordered against everything prior and everything
>  subsequent. Therefore a fully ordered primitive is like having an smp_mb()
>  before and an smp_mb() after the primitive.
> 
> So, I take that to mean that both atomic_add_unless() and
> atomic_dec_return() are "fully ordered" and "fully ordered" is a super
> set of acquire/release.
> 
> Thus, we already have the necessary barriers integrated into the
> atomic being used.
> 
> The smb_mb_after_atomic stuff is to be used with atomics that don't
> return values, there are some examples in the doc

Yes, I missed the point that RMW operations that return a value are
fully ordered and imply smp_mb() before / after.

-- 
Thanks,

David / dhildenb


  parent reply	other threads:[~2022-08-31  7:19 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-25 16:46 [PATCH v1 0/3] mm: minor cleanups around NUMA hinting David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 1/3] mm/gup: replace FOLL_NUMA by gup_can_follow_protnone() David Hildenbrand
2022-08-25 16:46 ` [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in GUP-fast David Hildenbrand
2022-08-26 14:59   ` David Hildenbrand
2022-08-30 18:23     ` David Hildenbrand
2022-08-30 18:45       ` Jason Gunthorpe
2022-08-30 18:53         ` David Hildenbrand
2022-08-30 19:18           ` John Hubbard
2022-08-30 19:23             ` David Hildenbrand
2022-08-30 23:44               ` Jason Gunthorpe
2022-08-31  7:44                 ` David Hildenbrand
2022-08-31 16:21               ` Peter Xu
2022-08-31 16:31                 ` David Hildenbrand
2022-08-31 18:23                   ` Peter Xu
2022-08-31 19:25                     ` David Hildenbrand
2022-09-01  7:55                       ` Alistair Popple
2022-08-30 19:57           ` Jason Gunthorpe
2022-08-30 20:12             ` John Hubbard
2022-08-30 22:39               ` Jason Gunthorpe
2022-08-31  7:15             ` David Hildenbrand [this message]
2022-08-25 16:46 ` [PATCH v1 3/3] mm: fixup documentation regarding pte_numa() and PROT_NUMA David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=60bbe75b-8dd1-3c46-2f8f-c2407493ffb8@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=peterx@redhat.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.