Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations

From: Nadav Amit <namit@vmware.com>
To: Peter Xu <peterx@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.ibm.com>
Subject: Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
Date: Mon, 18 Jul 2022 20:59:37 +0000	[thread overview]
Message-ID: <8F18AE8D-2496-4F8C-90C2-D537E88F7137@vmware.com> (raw)
In-Reply-To: <YtW9Al4RXFWE9PoT@xz-m1.local>

On Jul 18, 2022, at 1:05 PM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote:
>> @@ -261,6 +272,7 @@ struct uffdio_copy {
>> struct uffdio_zeropage {
>>      struct uffdio_range range;
>> #define UFFDIO_ZEROPAGE_MODE_DONTWAKE                ((__u64)1<<0)
>> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY   ((__u64)1<<1)
> 
> Would access hint help zeropage use case?  I remembered you used to comment
> around and said it won't help since we won't reclaim zero page anyway.

I agree that there is no meaning for access bit on zero page. I just think
that it is best to have the flags for consistency. If you ask me, I would
prefer to have all the flags in a fixed place (highest bits?). Anyhow, if we
expose the hints as a feature, I do not think we would later want to say
“here is another feature that enables another hint that we thought is not
needed before”. Userfaultfd’s feature bits are already nuts, IMHO.

> It won't help either even if this flag is only used for the follow up
> WRITE_HINT (since then there'll be a CoW) because when WRITE_HINT attached
> it doesn't make sense to not have ACCESS_HINT, then it seems the WRITE_HINT
> itself would be enough for ZEROPAGE to me.

Agreed. Again, I think it is worthy for consistency.

> [...]
> 
>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> index 421784d26651..c15679f3eb6a 100644
>> --- a/mm/userfaultfd.c
>> +++ b/mm/userfaultfd.c
>> @@ -65,6 +65,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>>      bool writable = dst_vma->vm_flags & VM_WRITE;
>>      bool vm_shared = dst_vma->vm_flags & VM_SHARED;
>>      bool page_in_cache = page->mapping;
>> +     bool prefault = !(uffd_flags & UFFD_FLAGS_ACCESS_LIKELY);
> 
> I think it's okay to name it "prefault" as a temp var, but ideally IMHO we
> shouldn't assume what the user app is doing - it is only installing some
> uffd pgtables with !ACCESS_LIKELY and it does not necessarily need to be a
> prefault process..
> 
>>      spinlock_t *ptl;
>>      struct inode *inode;
>>      pgoff_t offset, max_off;
>> @@ -92,6 +93,11 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>>               */
>>              _dst_pte = pte_wrprotect(_dst_pte);
>> 
>> +     if (prefault && arch_wants_old_prefaulted_pte())
>> +             _dst_pte = pte_mkold(_dst_pte);
>> +     else
>> +             _dst_pte = pte_sw_mkyoung(_dst_pte);
> 
> Could you explain why we couldn't unconditionally mkold here even for x86?

To answer this question and the previous one, please note that the logic is
“borrowed” from do_set_pte(). If you want me to refactor and extract a
function, please let me know.

Here is the deal: for x86, we don’t do pte_mkold() because setting the
access bit is expensive (>500 cycles). For arm64 that have access-bit we
don’t since (according to arm64 code or commit log), the cost of setting the
access bit on arm is low.

> It'll be a pity if this feature bit will only be useful on arm64 but not
> covering x86 (which is so far still the majority I think).
> 
> IMHO it's slightly different here comparing to kernel prefaults - the uesr
> app may not be aware of kernel prefaults, but here !ACCESS_HINT it's
> user-aware, and it's what user app explicitly provided.  IMO it's a
> stronger proof of a cold page already.

I’m ok with that if that is your choice. I actually prefer to give userspace
more control, but I tried to be consistent with other parts of the kernel.
Having said that, it’s really hard for me to see why young bit would be clear,
but dirty bit would be set...

> The other thing I got confused here is arch_wants_old_prefaulted_pte()
> returns true if arm64 supports hardware AF.  However for all the rest archs
> (including x86_64 which, afaict, support AF too in most models) it'll
> constantly return false.  Do you know what's the rational behind?

All x86 (32/64) since 386 support access-bit in the page-tables (IIRC, 286
had access bit in the segments).

I thought we discussed it before: if you access an old PTE on x86, you pay
>500 cycles; this actually affected UnixBench when people tried to change
this behavior [1]. In contrast, on arm64, which I have never profiled, you
probably saw the comment saying: "Experimentally, it's cheap to set the
access flag in hardware and we benefit from prefaulting mappings as 'old’ to
start with.”.

I do not know what happens on other architectures.

( sorry if I have some repetitions in this email )

[1] https://marc.info/?l=linux-kernel&m=146582237922378&w=2