All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Nadav Amit <namit@vmware.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Nadav Amit <nadav.amit@gmail.com>
Subject: Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
Date: Tue, 12 Jul 2022 10:56:34 -0400	[thread overview]
Message-ID: <Ys2LoqBlCep6q/A9@xz-m1.local> (raw)
In-Reply-To: <5D85870C-CBDF-45F7-A3A5-5F889521BE41@vmware.com>

Hi, Nadav,

On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote:
> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> > From: Nadav Amit <namit@vmware.com>
> > 
> > Using a PTE on x86 with cleared access-bit (aka young-bit)
> > takes ~600 cycles more than when the access bit is set. At the same
> > time, setting the access-bit for memory that is not used (e.g.,
> > prefetched) can introduce greater overheads, as the prefetched memory is
> > reclaimed later than it should be.
> > 
> > Userfaultfd currently does not set the access-bit (excluding the
> > huge-pages case). Arguably, it is best to let the user control whether
> > the access bit should be set or not. The expected use is to request
> > userfaultfd to set the access-bit when the copy/wp operation is done to
> > resolve a page-fault, and not to set the access-bit when the memory is
> > prefetched.
> > 
> > Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
> > young bit to be set.
> 
> I reply to my own email, but this mostly addresses the concerns that Peter
> has raised.
> 
> So I ran the test below on my Haswell (x86), which showed two things:
> 
> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles
> more than with dirty+young (depending on the access, of course: dirty
> does not matter for read, dirty+young both matter for write).
> 
> 2. I made a mistake in my implementation. PTEs are - at least on x86 -
> created as young with mk_pte(). So the logic should be similar to
> do_set_pte():
> 
>         if (prefault && arch_wants_old_prefaulted_pte())
>                 entry = pte_mkold(entry);
>         else
>                 entry = pte_sw_mkyoung(entry);
> 
> Based on these results, I will send another version for both young and
> dirty. Let me know if these results are not convincing.

Thanks for trying to verify this idea, but I'm not fully sure this is what
my concern was on WRITE_LIKELY.

AFAICT the test below was trying to measure the overhead of hardware
setting either access or dirty or both bits when they're not set for
read/write.

What I wanted as a justification is whether WRITE_LIKELY would be helpful
in any real world scenario at all.  AFAIK the only way to prove it so far
is to measure any tlb flush difference (probably only on x86, since that
tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not
trigger with W=0,D=0 (where W stands for "write bit", and D stands for
"dirty bit").

It's not about the slowness when D is cleared.

The core thing is (sorry to rephrase, but just hope we're on the same page)
we'll set D bit always for all uffd pages so far.  Even if we want to
change that behavior so we skip setting D bit for RO pages (we'll need to
convert the dirty bit into PageDirty though), we'll still always set D bit
for writable pages.  So we always set D bit as long as possible and we'll
never suffer from hardware overhead on setting D bit for uffd pages.

The other worry of having WRITE_HINT is, after we have it we probably need
to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a
very light ABI change since we used to always set it), then I'll start to
worry the hardware setting D bit overhead you just measured because we'll
have that overhead when user didn't specify WRITE_HINT with the old code.

So again, I'm totally fine if you want to start with ACCESS_HINT only, but
I still don't see why we should need WRITE_HINT too..

Thanks,

> 
> I will add, as we discussed (well, I think I raised these things, so
> hopefully you agree):
> 
> 1. On x86, avoid flush if changing WP->RO and PTE is clean.
> 
> 2. When write-unprotecting entry, if PTE is exclusive, set it as writable.
> [ I considered not setting it as writable if write-hint is not provided, but
> with the change in (1), it does not provide any real value. ]
> 
> ---
> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <stdio.h>
> #include <linux/userfaultfd.h>
> #include <pthread.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <poll.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> #include <stdint.h>
> #include <stdbool.h>
> 
> #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
> 		       } while (0)
> 
> static inline uint64_t rdtscp(void)
> {
> 	uint64_t rax, rdx;
> 	uint32_t aux;
> 	asm volatile ("rdtscp" : "=a" (rax), "=d" (rdx), "=c" (aux):: "memory");
> }
> 
> int main(int argc, char *argv[])
> {
> 	long uffd;          /* userfaultfd file descriptor */
> 	char *addr;         /* Start of region handled by userfaultfd */
> 	unsigned long len;  /* Length of region handled by userfaultfd */
> 	pthread_t thr;      /* ID of thread that handles page faults */
> 	bool young, dirty, write;
> 	struct uffdio_api uffdio_api;
> 	struct uffdio_register uffdio_register;
> 	int l;
> 	static char *page = NULL;
> 	struct uffdio_copy uffdio_copy;
> 	ssize_t nread;
> 	int page_size;
> 
> 	if (argc != 5) {
> 		fprintf(stderr, "Usage: %s [num-pages] [write] [young] [dirty]\n", argv[0]);
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	page_size = sysconf(_SC_PAGE_SIZE);
> 	len = strtoul(argv[1], NULL, 0) * page_size;
> 	write = !!strtoul(argv[2], NULL, 0);
> 	young = !!strtoul(argv[3], NULL, 0);
> 	dirty = !!strtoul(argv[4], NULL, 0);
> 
> 	page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
> 		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> 	if (page == MAP_FAILED)
> 		errExit("mmap");
> 
> 	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> 	if (uffd == -1)
> 		errExit("userfaultfd");
> 
> 	uffdio_api.api = UFFD_API;
> 	uffdio_api.features = (1<<11); //UFFD_FEATURE_EXACT_ADDRESS;
> 	if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
> 		errExit("ioctl-UFFDIO_API");
> 
> 	addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
> 		MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if (addr == MAP_FAILED)
> 		errExit("mmap");
> 
> 	uffdio_register.range.start = (unsigned long) addr;
> 	uffdio_register.range.len = len;
> 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
> 		errExit("ioctl-UFFDIO_REGISTER");
> 
> 	uffdio_copy.src = (unsigned long) page;
> 	uffdio_copy.mode = 0;
> 	if (young)
> 		uffdio_copy.mode |= (1ul << 2);
> 	if (dirty)
> 		uffdio_copy.mode |= (1ul << 3);
> 
> 	uffdio_copy.len = page_size;
> 	uffdio_copy.copy = 0;
> 
> 	for (l = 0; l < len; l += page_size) {
> 		uffdio_copy.dst = (unsigned long)(&addr[l]);
> 		if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
> 			errExit("ioctl-UFFDIO_COPY");
> 	}
> 
> 	for (l = 0; l < len; l += page_size) {
> 		char c;
> 		uint64_t start;
> 
> 		start = rdtscp();
> 		if (write)
> 			addr[l] = 5;
> 		else
> 			c = *(volatile char *)(&addr[l]);
> 		printf("%ld\n", rdtscp() - start);
> 	}
> 
> 	exit(EXIT_SUCCESS);
> }
> 

-- 
Peter Xu



  reply	other threads:[~2022-07-12 14:56 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-22 18:50 ` [PATCH v1 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-23 21:57   ` Peter Xu
2022-06-23 22:04     ` Nadav Amit
2022-06-22 18:50 ` [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
2022-06-23 23:24   ` Peter Xu
2022-06-23 23:35     ` Nadav Amit
2022-06-23 23:49       ` Peter Xu
2022-06-24  0:03         ` Nadav Amit
2022-06-24  2:05           ` Peter Xu
2022-06-24  2:42             ` Nadav Amit
2022-06-24 21:58               ` Peter Xu
2022-06-24 22:17                 ` Peter Xu
2022-06-25  7:49                   ` Nadav Amit
2022-06-27 13:12                     ` Peter Xu
2022-06-27 13:27                       ` David Hildenbrand
2022-06-27 14:59                         ` Peter Xu
2022-06-27 23:37                       ` Nadav Amit
2022-06-28 10:55                         ` David Hildenbrand
2022-06-28 19:15                         ` Peter Xu
2022-06-28 20:30                           ` Nadav Amit
2022-06-28 20:56                             ` Peter Xu
2022-06-28 21:03                               ` Nadav Amit
2022-06-28 21:12                                 ` Peter Xu
2022-06-28 21:15                                   ` Nadav Amit
2022-07-12  6:19   ` Nadav Amit
2022-07-12 14:56     ` Peter Xu [this message]
2022-07-13  1:09       ` Nadav Amit
2022-07-13 16:02         ` Peter Xu
2022-07-13 16:49           ` Nadav Amit
2022-06-22 18:50 ` [PATCH v1 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
2022-06-22 18:50 ` [PATCH v1 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-23 23:34   ` Peter Xu
2022-06-22 18:50 ` [PATCH v1 5/5] selftest/userfaultfd: test read/write hints Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ys2LoqBlCep6q/A9@xz-m1.local \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=nadav.amit@gmail.com \
    --cc=namit@vmware.com \
    --cc=rppt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.