linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Review request: draft userfaultfd(2) manual page
@ 2017-03-20 20:08 Michael Kerrisk (man-pages)
  2017-03-20 20:11 ` Review request: draft ioctl_userfaultfd(2) " Michael Kerrisk (man-pages)
  2017-03-21 14:01 ` Review request: draft userfaultfd(2) " Mike Rapoport
  0 siblings, 2 replies; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-03-20 20:08 UTC (permalink / raw)
  To: Andrea Arcangeli, Mike Rapoport; +Cc: mtk.manpages, lkml, linux-mm, linux-man

[-- Attachment #1: Type: text/plain, Size: 20603 bytes --]

Hello Andrea, Mike, and all,

Mike: thanks for the page that you sent. I've reworked it
a bit, and also added a lot of further information,
and an example program. In the process, I split the page
into two pieces, with one piece describing the userfaultfd()
system call and the other describing the ioctl() operations.

I'd like to get review input, especially from you and
Andrea, but also anyone else, for the current version
of this page, which includes a few FIXMEs to be sorted.

I've shown the rendered version of the page below. 
The groff source is attached, and can also be found
at the branch here:

https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd

The new ioctl_userfaultfd(2) page follows this mail.

Cheers,

Michael


USERFAULTFD(2)         Linux Programmer's Manual        USERFAULTFD(2)

┌─────────────────────────────────────────────────────┐
│FIXME                                                │
├─────────────────────────────────────────────────────┤
│Need  to  describe close(2) semantics for userfaulfd │
│file descriptor: what happens when  the  userfaultfd │
│FD is closed?                                        │
│                                                     │
└─────────────────────────────────────────────────────┘

NAME
       userfaultfd - create a file descriptor for handling page faults
       in user space

SYNOPSIS
       #include <sys/types.h>
       #include <linux/userfaultfd.h>

       int userfaultfd(int flags);

       Note: There is no glibc  wrapper  for  this  system  call;  see
       NOTES.

DESCRIPTION
       userfaultfd() creates a new userfaultfd object that can be used
       for delegation of page-fault handling to a user-space  applica‐
       tion,  and  returns  a  file  descriptor that refers to the new
       object.   The  new  userfaultfd  object  is  configured   using
       ioctl(2).

       Once  the userfaultfd object is configured, the application can
       use read(2) to receive userfaultfd  notifications.   The  reads
       from  userfaultfd may be blocking or non-blocking, depending on
       the value of flags used for the creation of the userfaultfd  or
       subsequent calls to fcntl(2).

       The following values may be bitwise ORed in flags to change the
       behavior of userfaultfd():

       O_CLOEXEC
              Enable the close-on-exec flag for  the  new  userfaultfd
              file  descriptor.   See the description of the O_CLOEXEC
              flag in open(2).

       O_NONBLOCK
              Enables  non-blocking  operation  for  the   userfaultfd
              object.   See  the description of the O_NONBLOCK flag in
              open(2).

   Usage
       The userfaultfd mechanism is designed to allow a  thread  in  a
       multithreaded  program  to  perform  user-space  paging for the
       other threads in the process.  When a page fault occurs for one
       of the regions registered to the userfaultfd object, the fault‐
       ing thread is put to sleep and an event is generated  that  can
       be  read  via  the userfaultfd file descriptor.  The fault-han‐
       dling thread reads events from this file  descriptor  and  ser‐
       vices  them  using  the  operations  described  in  ioctl_user‐
       faultfd(2).  When servicing the page fault events,  the  fault-
       handling thread can trigger a wake-up for the sleeping thread.

   Userfaultfd operation
       After the userfaultfd object is created with userfaultfd(), the
       application must enable it using the UFFDIO_API ioctl(2) opera‐
       tion.  This operation allows a handshake between the kernel and
       user space to determine the API version and supported features.
       This  operation  must  be  performed  before  any  of the other
       ioctl(2) operations described below (or those  operations  fail
       with the EINVAL error).

       After  a  successful UFFDIO_API operation, the application then
       registers  memory  address  ranges  using  the  UFFDIO_REGISTER
       ioctl(2)  operation.   After  successful  completion  of a UFF‐
       DIO_REGISTER operation, a page fault occurring in the requested
       memory  range, and satisfying the mode defined at the registra‐
       tion time, will be forwarded by the kernel  to  the  user-space
       application.   The  application can then use the UFFDIO_COPY or
       UFFDIO_ZERO ioctl(2) operations to resolve the page fault.

       Details of the various ioctl(2)  operations  can  be  found  in
       ioctl_userfaultfd(2).

       Currently,  userfaultfd can be used only with anonymous private
       memory mappings.

   Reading from the userfaultfd structure
       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │are the details below correct?                       │
       └─────────────────────────────────────────────────────┘
       Each read(2) from the userfaultfd file descriptor  returns  one
       or  more  uffd_msg  structures, each of which describes a page-
       fault event:

           struct uffd_msg {
               __u8  event;                /* Type of event */
               ...
               union {
                   struct {
                       __u64 flags;        /* Flags describing fault */
                       __u64 address;      /* Faulting address */
                   } pagefault;
                   ...
               } arg;

               /* Padding fields omitted */
           } __packed;

       If multiple events are available and  the  supplied  buffer  is
       large enough, read(2) returns as many events as will fit in the
       supplied buffer.  If the buffer supplied to read(2) is  smaller
       than the size of the uffd_msg structure, the read(2) fails with
       the error EINVAL.

       The fields set in the uffd_msg structure are as follows:

       event  The type of event.  Currently, only one value can appear
              in  this  field: UFFD_EVENT_PAGEFAULT, which indicates a
              page-fault event.

       address
              The address that triggered the page fault.

       flags  A bit mask  of  flags  that  describe  the  event.   For
              UFFD_EVENT_PAGEFAULT, the following flag may appear:

              UFFD_PAGEFAULT_FLAG_WRITE
                     If  the address is in a range that was registered
                     with the UFFDIO_REGISTER_MODE_MISSING  flag  (see
                     ioctl_userfaultfd(2))  and this flag is set, this
                     a write fault; otherwise it is a read fault.

       A read(2) on a userfaultfd file descriptor can  fail  with  the
       following errors:

       EINVAL The  userfaultfd  object  has not yet been enabled using
              the UFFDIO_API ioctl(2) operation

       The userfaultfd file descriptor can be monitored with  poll(2),
       select(2),  and  epoll(7).  When events are available, the file
       descriptor indicates as readable.


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │But, it seems,  the  object  must  be  created  with │
       │O_NONBLOCK.  What is the rationale for this require‐ │
       │ment? Something needs to  be  said  in  this  manual │
       │page.                                                │
       └─────────────────────────────────────────────────────┘

RETURN VALUE
       On  success,  userfaultfd()  returns a new file descriptor that
       refers to the userfaultfd object.  On error,  -1  is  returned,
       and errno is set appropriately.

ERRORS
       EINVAL An unsupported value was specified in flags.

       EMFILE The  per-process  limit  on  the  number  of  open  file
              descriptors has been reached

       ENFILE The system-wide limit on the total number of open  files
              has been reached.

       ENOMEM Insufficient kernel memory was available.

VERSIONS
       The userfaultfd() system call first appeared in Linux 4.3.

CONFORMING TO
       userfaultfd()  is Linux-specific and should not be used in pro‐
       grams intended to be portable.

NOTES
       Glibc does not provide a wrapper for this system call; call  it
       using syscall(2).

       The userfaultfd mechanism can be used as an alternative to tra‐
       ditional user-space paging techniques based on the use  of  the
       SIGSEGV  signal  and mmap(2).  It can also be used to implement
       lazy restore for  checkpoint/restore  mechanisms,  as  well  as
       post-copy  migration  to allow (nearly) uninterrupted execution
       when transferring virtual machines from one host to another.

EXAMPLE
       The program below demonstrates the use of the userfaultfd mech‐
       anism.   The  program creates two threads, one of which acts as
       the page-fault handler for the process,  for  the  pages  in  a
       demand-page zero region created using mmap(2).

       The  program takes one command-line argument, which is the num‐
       ber of pages that will be  created  in  a  mapping  whose  page
       faults will be handled via userfaultfd.  After creating a user‐
       faultfd object, the program then creates an  anonymous  private
       mapping  of  the specified size and registers the address range
       of that mapping using the UFFDIO_REGISTER  ioctl(2)  operation.
       The  program then creates a second thread that will perform the
       task of handling page faults.

       The main thread then walks through the  pages  of  the  mapping
       fetching  bytes  from successive pages.  Because the pages have
       not yet been accessed, the first access of a byte in each  page
       will  trigger  a  page-fault  event  on  the  userfaultfd  file
       descriptor.

       Each of the page-fault events is handled by the second  thread,
       which sits in a loop processing input from the userfaultfd file
       descriptor.  In each loop iteration, the  second  thread  first
       calls  poll(2)  to  check the state of the file descriptor, and
       then reads an event from the file descriptor.  All such  events
       should be UFFD_EVENT_PAGEFAULT events, which the thread handles
       by copying a page of data into the faulting  region  using  the
       UFFDIO_COPY ioctl(2) operation.

       The  following  is  an  example of what we see when running the
       program:

           $ ./userfaultfd_demo 3
           Address returned by mmap() = 0x7fd30106c000

           fault_handler_thread():
               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
                   (uffdio_copy.copy returned 4096)
           Read address 0x7fd30106c00f in main(): A
           Read address 0x7fd30106c40f in main(): A
           Read address 0x7fd30106c80f in main(): A
           Read address 0x7fd30106cc0f in main(): A

           fault_handler_thread():
               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
                   (uffdio_copy.copy returned 4096)
           Read address 0x7fd30106d00f in main(): B
           Read address 0x7fd30106d40f in main(): B
           Read address 0x7fd30106d80f in main(): B
           Read address 0x7fd30106dc0f in main(): B

           fault_handler_thread():
               poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
               UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
                   (uffdio_copy.copy returned 4096)
           Read address 0x7fd30106e00f in main(): C
           Read address 0x7fd30106e40f in main(): C
           Read address 0x7fd30106e80f in main(): C
           Read address 0x7fd30106ec0f in main(): C

   Program source

       /* userfaultfd_demo.c

          Licensed under the GNU General Public License version 2 or later.
       */
       #define _GNU_SOURCE
       #include <sys/types.h>
       #include <stdio.h>
       #include <linux/userfaultfd.h>
       #include <pthread.h>
       #include <errno.h>
       #include <unistd.h>
       #include <stdlib.h>
       #include <fcntl.h>
       #include <signal.h>
       #include <poll.h>
       #include <string.h>
       #include <sys/mman.h>
       #include <sys/syscall.h>
       #include <sys/ioctl.h>
       #include <poll.h>

       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                               } while (0)

       static int page_size;

       static void *
       fault_handler_thread(void *arg)
       {
           static struct uffd_msg msg;   /* Data read from userfaultfd */
           static int fault_cnt = 0;     /* Number of faults so far handled */
           long uffd;                    /* userfaultfd file descriptor */
           static char *page = NULL;
           struct uffdio_copy uffdio_copy;
           ssize_t nread;

           uffd = (long) arg;

           /* Create a page that will be copied into the faulting region */

           if (page == NULL) {
               page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
               if (page == MAP_FAILED)
                   errExit("mmap");
           }

           /* Loop, handling incoming events on the userfaultfd
              file descriptor */

           for (;;) {

               /* See what poll() tells us about the userfaultfd */

               struct pollfd pollfd;
               int nready;
               pollfd.fd = uffd;
               pollfd.events = POLLIN;
               nready = poll(&pollfd, 1, -1);
               if (nready == -1)
                   errExit("poll");

               printf("\nfault_handler_thread():\n");
               printf("    poll() returns: nready = %d; "
                       "POLLIN = %d; POLLERR = %d\n", nready,
                       (pollfd.revents & POLLIN) != 0,
                       (pollfd.revents & POLLERR) != 0);

               /* Read an event from the userfaultfd */

               nread = read(uffd, &msg, sizeof(msg));
               if (nread == 0) {
                   printf("EOF on userfaultfd!\n");
                   exit(EXIT_FAILURE);
               }

               if (nread == -1)
                   errExit("read");

               /* We expect only one kind of event; verify that assumption */

               if (msg.event != UFFD_EVENT_PAGEFAULT) {
                   fprintf(stderr, "Unexpected event on userfaultfd\n");
                   exit(EXIT_FAILURE);
               }

               /* Display info about the page-fault event */

               printf("    UFFD_EVENT_PAGEFAULT event: ");
               printf("flags = %llx; ", msg.arg.pagefault.flags);
               printf("address = %llx\n", msg.arg.pagefault.address);

               /* Copy the page pointed to by 'page' into the faulting
                  region. Vary the contents that are copied in, so that it
                  is more obvious that each fault is handled separately. */

               memset(page, 'A' + fault_cnt % 20, page_size);
               fault_cnt++;

               uffdio_copy.src = (unsigned long) page;

               /* We need to handle page faults in units of pages(!).
                  So, round faulting address down to page boundary */

               uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
                                                  ~(page_size - 1);
               uffdio_copy.len = page_size;
               uffdio_copy.mode = 0;
               uffdio_copy.copy = 0;
               if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
                   errExit("ioctl-UFFDIO_COPY");

               printf("        (uffdio_copy.copy returned %lld)\n",
                       uffdio_copy.copy);
           }
       }

       int
       main(int argc, char *argv[])
       {
           long uffd;          /* userfaultfd file descriptor */
           char *addr;         /* Start of region handled by userfaultfd */
           unsigned long len;  /* Length of region handled by userfaultfd */
           pthread_t thr;      /* ID of thread that handles page faults */
           struct uffdio_api uffdio_api;
           struct uffdio_register uffdio_register;
           int s;

           if (argc != 2) {
               fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
               exit(EXIT_FAILURE);
           }

           page_size = sysconf(_SC_PAGE_SIZE);
           len = strtoul(argv[1], NULL, 0) * page_size;

           /* Create and enable userfaultfd object */

           uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
           if (uffd == -1)
               errExit("userfaultfd");

           uffdio_api.api = UFFD_API;
           uffdio_api.features = 0;
           if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
               errExit("ioctl-UFFDIO_API");

           /* Create a private anonymous mapping. The memory will be
              demand-zero paged--that is, not yet allocated. When we
              actually touch the memory, it will be allocated via
              the userfaultfd. */

           addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
           if (addr == MAP_FAILED)
               errExit("mmap");

           printf("Address returned by mmap() = %p\n", addr);

           /* Register the memory range of the mapping we just created for
              handling by the userfaultfd object. In mode, we request to track
              missing pages (i.e., pages that have not yet been faulted in). */

           uffdio_register.range.start = (unsigned long) addr;
           uffdio_register.range.len = len;
           uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
           if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
               errExit("ioctl-UFFDIO_REGISTER");

           /* Create a thread that will process the userfaultfd events */

           s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
           if (s != 0) {
               errno = s;
               errExit("pthread_create");
           }

           /* Main thread now touches memory in the mapping, touching
              locations 1024 bytes apart. This will trigger userfaultfd
              events for all pages in the region. */

           int l;
           l = 0xf;    /* Ensure that faulting address is not on a page
                          boundary, in order to test that we correctly
                          handle that case in fault_handling_thread() */
           while (l < len) {
               char c = addr[l];
               printf("Read address %p in main(): ", addr + l);
               printf("%c\n", c);
               l += 1024;
               usleep(100000);         /* Slow things down a little */
           }

           exit(EXIT_SUCCESS);
       }

SEE ALSO
       fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)

       Documentation/vm/userfaultfd.txt in  the  Linux  kernel  source
       tree


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

[-- Attachment #2: userfaultfd.2 --]
[-- Type: application/x-troff-man, Size: 16693 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Review request: draft ioctl_userfaultfd(2) manual page
  2017-03-20 20:08 Review request: draft userfaultfd(2) manual page Michael Kerrisk (man-pages)
@ 2017-03-20 20:11 ` Michael Kerrisk (man-pages)
  2017-03-22 13:54   ` Mike Rapoport
  2017-03-21 14:01 ` Review request: draft userfaultfd(2) " Mike Rapoport
  1 sibling, 1 reply; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-03-20 20:11 UTC (permalink / raw)
  To: Andrea Arcangeli, Mike Rapoport; +Cc: mtk.manpages, lkml, linux-mm, linux-man

[-- Attachment #1: Type: text/plain, Size: 20555 bytes --]

Hello Andrea, Mike, and all,

Mike: here's the split out page that describes the 
userfaultfd ioctl() operations.

I'd like to get review input, especially from you and
Andrea, but also anyone else, for the current version
of this page, which includes quite a few FIXMEs to be
sorted.

I've shown the rendered version of the page below. 
The groff source is attached, and can also be found
at the branch here:

https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd

The new ioctl_userfaultfd(2) page follows this mail.

Cheers,

Michael

NAME
       userfaultfd - create a file descriptor for handling page faults in user
       space

SYNOPSIS
       #include <sys/ioctl.h>

       int ioctl(int fd, int cmd, ...);

DESCRIPTION
       Various ioctl(2) operations can be performed on  a  userfaultfd  object
       (created by a call to userfaultfd(2)) using calls of the form:

           ioctl(fd, cmd, argp);

       In  the  above,  fd  is  a  file  descriptor referring to a userfaultfd
       object, cmd is one of the commands listed below, and argp is a  pointer
       to a data structure that is specific to cmd.

       The  various  ioctl(2) operations are described below.  The UFFDIO_API,
       UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
       userfaultfd behavior.  These operations allow the caller to choose what
       features will be enabled and what kinds of events will be delivered  to
       the application.  The remaining operations are range operations.  These
       operations enable the calling application to resolve page-fault  events
       in a consistent way.


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Above: What does "consistent" mean?                  │
       │                                                     │
       └─────────────────────────────────────────────────────┘

   UFFDIO_API
       (Since Linux 4.3.)  Enable operation of the userfaultfd and perform API
       handshake.  The argp argument is a pointer to a  uffdio_api  structure,
       defined as:

           struct uffdio_api {
               __u64 api;        /* Requested API version (input) */
               __u64 features;   /* Must be zero */
               __u64 ioctls;     /* Available ioctl() operations (output) */
           };

       The  api  field  denotes  the API version requested by the application.
       Before the call, the features field must be initialized to zero.


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Above: Why must the 'features' field be  initialized │
       │to zero?                                             │
       └─────────────────────────────────────────────────────┘

       The  kernel verifies that it can support the requested API version, and
       sets the features and ioctls fields to bit masks representing  all  the
       available features and the generic ioctl(2) operations available.  Cur‐
       rently, zero (i.e., no feature bits) is placed in the  features  field.
       The returned ioctls field can contain the following bits:


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │This  user-space  API  seems not fully polished. Why │
       │are there not constants defined for each of the bit- │
       │mask values listed below?                            │
       └─────────────────────────────────────────────────────┘

       1 << _UFFDIO_API
              The UFFDIO_API operation is supported.

       1 << _UFFDIO_REGISTER
              The UFFDIO_REGISTER operation is supported.

       1 << _UFFDIO_UNREGISTER
              The UFFDIO_UNREGISTER operation is supported.


              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │Is  the above description of the 'ioctls' field cor‐ │
              │rect?  Does more need to be said?                    │
              │                                                     │
              └─────────────────────────────────────────────────────┘

       This ioctl(2) operation returns 0 on success.  On error, -1 is returned
       and  errno  is set to indicate the cause of the error.  Possible errors
       include:


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Is the following error list correct?                 │
       │                                                     │
       └─────────────────────────────────────────────────────┘

       EINVAL The userfaultfd has already been  enabled  by  a  previous  UFF‐
              DIO_API operation.

       EINVAL The  API  version requested in the api field is not supported by
              this kernel, or the features field was not zero.


              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │In the above error case, the  returned  'uffdio_api' │
              │structure  zeroed out. Why is this done? This should │
              │be explained in the manual page.                     │
              │                                                     │
              └─────────────────────────────────────────────────────┘

   UFFDIO_REGISTER
       (Since Linux 4.3.)  Register a memory  address  range  with  the  user‐
       faultfd  object.   The  argp argument is a pointer to a uffdio_register
       structure, defined as:

           struct uffdio_range {
               __u64 start;    /* Start of range */
               __u64 len;      /* Length of rnage (bytes) */
           };

           struct uffdio_register {
               struct uffdio_range range;
               __u64 mode;     /* Desired mode of operation (input) */
               __u64 ioctls;   /* Available ioctl() operations (output) */
           };


       The range field defines a memory range starting at start and continuing
       for len bytes that should be handled by the userfaultfd.

       The  mode  field  defines the mode of operation desired for this memory
       region.  The following values may be bitwise  ORed  to  set  the  user‐
       faultfd mode for the specified range:

       UFFDIO_REGISTER_MODE_MISSING
              Track page faults on missing pages.

       UFFDIO_REGISTER_MODE_WP
              Track page faults on write-protected pages.

       Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.

       If the operation is successful, the kernel modifies the ioctls bit-mask
       field to indicate which ioctl(2) operations are available for the spec‐
       ified range.  This returned bit mask is as for UFFDIO_API.

       This ioctl(2) operation returns 0 on success.  On error, -1 is returned
       and errno is set to indicate the cause of the error.   Possible  errors
       include:


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Is the following error list correct?                 │
       │                                                     │
       └─────────────────────────────────────────────────────┘

       EBUSY  A  mapping  in  the  specified  range is registered with another
              userfaultfd object.

       EINVAL An invalid or unsupported bit was specified in the  mode  field;
              or the mode field was zero.

       EINVAL There is no mapping in the specified address range.

       EINVAL range.start  or  range.len  is not a multiple of the system page
              size; or, range.len is  zero;  or  these  fields  are  otherwise
              invalid.

       EINVAL There as an incompatible mapping in the specified address range.


              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │Above: What does "incompatible" mean?                │
              │                                                     │
              └─────────────────────────────────────────────────────┘

   UFFDIO_UNREGISTER
       (Since Linux 4.3.)  Unregister a memory address range from userfaultfd.
       The address range to unregister is specified in the uffdio_range struc‐
       ture pointed to by argp.

       This ioctl(2) operation returns 0 on success.  On error, -1 is returned
       and errno is set to indicate the cause of the error.   Possible  errors
       include:

       EINVAL Either  the  start or the len field of the ufdio_range structure
              was not a multiple of the system page size; or the len field was
              zero; or these fields were otherwise invalid.

       EINVAL There as an incompatible mapping in the specified address range.


              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │Above: What does "incompatible" mean?                │
              └─────────────────────────────────────────────────────┘

       EINVAL There was no mapping in the specified address range.

   UFFDIO_COPY
       (Since  Linux 4.3.)  Atomically copy a continuous memory chunk into the
       userfault registered range and optionally wake up the  blocked  thread.
       The  source  and  destination addresses and the number of bytes to copy
       are specified by the src, dst, and len fields of the uffdio_copy struc‐
       ture pointed to by argp:

           struct uffdio_copy {
               __u64 dst;    /* Source of copy */
               __u64 src;    /* Destinate of copy */
               __u64 len;    /* Number of bytes to copy */
               __u64 mode;   /* Flags controlling behavior of copy */
               __s64 copy;   /* Number of bytes copied, or negated error */
           };

       The  following value may be bitwise ORed in mode to change the behavior
       of the UFFDIO_COPY operation:

       UFFDIO_COPY_MODE_DONTWAKE
              Do not wake up the thread that waits for page-fault resolution

       The copy field is used by the kernel to return the number of bytes that
       was actually copied, or an error (a negated errno-style value).


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Above:  Why is the 'copy' field used to return error │
       │values?  This should  be  explained  in  the  manual │
       │page.                                                │
       └─────────────────────────────────────────────────────┘
       If  the  value returned in copy doesn't match the value that was speci‐
       fied in len, the operation fails with the error EAGAIN.  The copy field
       is output-only; it is not read by the UFFDIO_COPY operation.

       This ioctl(2) operation returns 0 on success.  In this case, the entire
       area was copied.  On error, -1 is returned and errno is set to indicate
       the cause of the error.  Possible errors include:

       EAGAIN The number of bytes copied (i.e., the value returned in the copy
              field) does not equal the value that was specified  in  the  len
              field.

       EINVAL Either dst or len was not a multiple of the system page size, or
              the range specified by src and len or dst and len was invalid.

       EINVAL An invalid bit was specified in the mode field.

   UFFDIO_ZEROPAGE
       (Since Linux 4.3.)  Zero out  a  memory  range  registered  with  user‐
       faultfd.   The  requested  range is specified by the range field of the
       uffdio_zeropage structure pointed to by argp:

           struct uffdio_zeropage {
               struct uffdio_range range;
               __u64 mode;     /* Flags controlling behavior of copy */
               __s64 zeropage; /* Number of bytes zeroed, or negated error */
           };

       The following value may be bitwise ORed in mode to change the  behavior
       of the UFFDIO_ZERO operation:

       UFFDIO_ZEROPAGE_MODE_DONTWAKE
              Do not wake up the thread that waits for page-fault resolution.

       The  zeropage field is used by the kernel to return the number of bytes
       that was actually zeroed, or an  error  in  the  same  manner  as  UFF‐
       DIO_COPY.


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Why  is  the  'zeropage'  field used to return error │
       │values?  This should  be  explained  in  the  manual │
       │page.                                                │
       └─────────────────────────────────────────────────────┘
       If  the  value  returned  in the zeropage field doesn't match the value
       that was specified in range.len, the operation  fails  with  the  error
       EAGAIN.   The zeropage field is output-only; it is not read by the UFF‐
       DIO_ZERO operation.

       This ioctl(2) operation returns 0 on success.  In this case, the entire
       area was zeroed.  On error, -1 is returned and errno is set to indicate
       the cause of the error.  Possible errors include:

       EAGAIN The number of bytes zeroed (i.e.,  the  value  returned  in  the
              zeropage  field)  does not equal the value that was specified in
              the range.len field.

       EINVAL Either range.start or range.len was not a multiple of the system
              page  size;  or  range.len  was zero; or the range specified was
              invalid.

       EINVAL An invalid bit was specified in the mode field.

   UFFDIO_WAKE
       (Since Linux 4.3.)  Wake up the thread waiting for  page-fault  resolu‐
       tion  on  a  specified  memory  address  range.  The argp argument is a
       pointer to a uffdio_range structure (shown above)  that  specifies  the
       address range.


       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Need more detail here. How is the UFFDIO_WAKE opera‐ │
       │tion used?                                           │
       └─────────────────────────────────────────────────────┘

       This ioctl(2) operation returns 0 on success.  On error, -1 is returned
       and  errno  is set to indicate the cause of the error.  Possible errors
       include:

       EINVAL The start or the len field of the ufdio_range structure was  not
              a  multiple  of  the  system  page size; or len was zero; or the
              specified range was otherwise invalid.

RETURN VALUE
       See descriptions of the individual operations, above.

ERRORS
       See descriptions of the individual operations, above.  In addition, the
       following  general errors can occur for all of the operations described
       above:

       EFAULT argp does not point to a valid memory address.

       EINVAL (For all operations except UFFDIO_API.)  The userfaultfd  object
              has not yet been enabled (via the UFFDIO_API operation).

CONFORMING TO
       These ioctl(2) operations are Linux-specific.

EXAMPLE
       See userfaultfd(2).

SEE ALSO
       ioctl(2), mmap(2), userfaultfd(2)

       Documentation/vm/userfaultfd.txt in the Linux kernel source tree


[-- Attachment #2: ioctl_userfaultfd.2 --]
[-- Type: application/x-troff-man, Size: 12202 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft userfaultfd(2) manual page
  2017-03-20 20:08 Review request: draft userfaultfd(2) manual page Michael Kerrisk (man-pages)
  2017-03-20 20:11 ` Review request: draft ioctl_userfaultfd(2) " Michael Kerrisk (man-pages)
@ 2017-03-21 14:01 ` Mike Rapoport
  2017-04-21  6:30   ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2017-03-21 14:01 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: Andrea Arcangeli, lkml, linux-mm, linux-man

Hello Michael,

On Mon, Mar 20, 2017 at 09:08:05PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Andrea, Mike, and all,
> 
> Mike: thanks for the page that you sent. I've reworked it
> a bit, and also added a lot of further information,
> and an example program. In the process, I split the page
> into two pieces, with one piece describing the userfaultfd()
> system call and the other describing the ioctl() operations.
> 
> I'd like to get review input, especially from you and
> Andrea, but also anyone else, for the current version
> of this page, which includes a few FIXMEs to be sorted.

Thanks for the update. I'm adressing the FIXME points you've mentioned
below.
Otherwise, everything seems the right description of the current upstream.
4.11 will have quite a few updates to userfault and we'll need to udpate
this page and ioctl_userfaultfd(2) to address those updates. I am planning
to work on the man update in the next few weeks. 
 
> I've shown the rendered version of the page below. 
> The groff source is attached, and can also be found
> at the branch here:
 
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
> 
> The new ioctl_userfaultfd(2) page follows this mail.
> 
> Cheers,
> 
> Michael
 
--
Sincerely yours,
Mike. 
 

> USERFAULTFD(2)         Linux Programmer's Manual        USERFAULTFD(2)
> 
> ┌─────────────────────────────────────────────────────┐
> │FIXME                                                │
> ├─────────────────────────────────────────────────────┤
> │Need  to  describe close(2) semantics for userfaulfd │
> │file descriptor: what happens when  the  userfaultfd │
> │FD is closed?                                        │
> │                                                     │
> └─────────────────────────────────────────────────────┘
 
When userfaultfd is closed, it unregisters all memory ranges that were
previously registered with it and flushes the outstanding page fault
events.

> NAME
>        userfaultfd - create a file descriptor for handling page faults
>        in user space
> 
> SYNOPSIS
>        #include <sys/types.h>
>        #include <linux/userfaultfd.h>
> 
>        int userfaultfd(int flags);
> 
>        Note: There is no glibc  wrapper  for  this  system  call;  see
>        NOTES.
> 
> DESCRIPTION
>        userfaultfd() creates a new userfaultfd object that can be used
>        for delegation of page-fault handling to a user-space  applica‐
>        tion,  and  returns  a  file  descriptor that refers to the new
>        object.   The  new  userfaultfd  object  is  configured   using
>        ioctl(2).
> 
>        Once  the userfaultfd object is configured, the application can
>        use read(2) to receive userfaultfd  notifications.   The  reads
>        from  userfaultfd may be blocking or non-blocking, depending on
>        the value of flags used for the creation of the userfaultfd  or
>        subsequent calls to fcntl(2).
> 
>        The following values may be bitwise ORed in flags to change the
>        behavior of userfaultfd():
> 
>        O_CLOEXEC
>               Enable the close-on-exec flag for  the  new  userfaultfd
>               file  descriptor.   See the description of the O_CLOEXEC
>               flag in open(2).
> 
>        O_NONBLOCK
>               Enables  non-blocking  operation  for  the   userfaultfd
>               object.   See  the description of the O_NONBLOCK flag in
>               open(2).
> 
>    Usage
>        The userfaultfd mechanism is designed to allow a  thread  in  a
>        multithreaded  program  to  perform  user-space  paging for the
>        other threads in the process.  When a page fault occurs for one
>        of the regions registered to the userfaultfd object, the fault‐
>        ing thread is put to sleep and an event is generated  that  can
>        be  read  via  the userfaultfd file descriptor.  The fault-han‐
>        dling thread reads events from this file  descriptor  and  ser‐
>        vices  them  using  the  operations  described  in  ioctl_user‐
>        faultfd(2).  When servicing the page fault events,  the  fault-
>        handling thread can trigger a wake-up for the sleeping thread.
> 
>    Userfaultfd operation
>        After the userfaultfd object is created with userfaultfd(), the
>        application must enable it using the UFFDIO_API ioctl(2) opera‐
>        tion.  This operation allows a handshake between the kernel and
>        user space to determine the API version and supported features.
>        This  operation  must  be  performed  before  any  of the other
>        ioctl(2) operations described below (or those  operations  fail
>        with the EINVAL error).
> 
>        After  a  successful UFFDIO_API operation, the application then
>        registers  memory  address  ranges  using  the  UFFDIO_REGISTER
>        ioctl(2)  operation.   After  successful  completion  of a UFF‐
>        DIO_REGISTER operation, a page fault occurring in the requested
>        memory  range, and satisfying the mode defined at the registra‐
>        tion time, will be forwarded by the kernel  to  the  user-space
>        application.   The  application can then use the UFFDIO_COPY or
>        UFFDIO_ZERO ioctl(2) operations to resolve the page fault.
> 
>        Details of the various ioctl(2)  operations  can  be  found  in
>        ioctl_userfaultfd(2).
> 
>        Currently,  userfaultfd can be used only with anonymous private
>        memory mappings.
> 
>    Reading from the userfaultfd structure
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │are the details below correct?                       │
>        └─────────────────────────────────────────────────────┘

Yes, at least for the current upstream version. 4.11 will have quite a few
updates to userfaultfd.

>        Each read(2) from the userfaultfd file descriptor  returns  one
>        or  more  uffd_msg  structures, each of which describes a page-
>        fault event:
> 
>            struct uffd_msg {
>                __u8  event;                /* Type of event */
>                ...
>                union {
>                    struct {
>                        __u64 flags;        /* Flags describing fault */
>                        __u64 address;      /* Faulting address */
>                    } pagefault;
>                    ...
>                } arg;
> 
>                /* Padding fields omitted */
>            } __packed;
> 
>        If multiple events are available and  the  supplied  buffer  is
>        large enough, read(2) returns as many events as will fit in the
>        supplied buffer.  If the buffer supplied to read(2) is  smaller
>        than the size of the uffd_msg structure, the read(2) fails with
>        the error EINVAL.
> 
>        The fields set in the uffd_msg structure are as follows:
> 
>        event  The type of event.  Currently, only one value can appear
>               in  this  field: UFFD_EVENT_PAGEFAULT, which indicates a
>               page-fault event.
> 
>        address
>               The address that triggered the page fault.
> 
>        flags  A bit mask  of  flags  that  describe  the  event.   For
>               UFFD_EVENT_PAGEFAULT, the following flag may appear:
> 
>               UFFD_PAGEFAULT_FLAG_WRITE
>                      If  the address is in a range that was registered
>                      with the UFFDIO_REGISTER_MODE_MISSING  flag  (see
>                      ioctl_userfaultfd(2))  and this flag is set, this
>                      a write fault; otherwise it is a read fault.
> 
>        A read(2) on a userfaultfd file descriptor can  fail  with  the
>        following errors:
> 
>        EINVAL The  userfaultfd  object  has not yet been enabled using
>               the UFFDIO_API ioctl(2) operation
> 
>        The userfaultfd file descriptor can be monitored with  poll(2),
>        select(2),  and  epoll(7).  When events are available, the file
>        descriptor indicates as readable.
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │But, it seems,  the  object  must  be  created  with │
>        │O_NONBLOCK.  What is the rationale for this require‐ │
>        │ment? Something needs to  be  said  in  this  manual │
>        │page.                                                │
>        └─────────────────────────────────────────────────────┘

The object can be created without O_NONBLOCK, so probably the above
sentence can be rephrased as:

When the userfaultfd file descriptor is opened in non-blocking mode, it can
be monitored with ...

> RETURN VALUE
>        On  success,  userfaultfd()  returns a new file descriptor that
>        refers to the userfaultfd object.  On error,  -1  is  returned,
>        and errno is set appropriately.
> 
> ERRORS
>        EINVAL An unsupported value was specified in flags.
> 
>        EMFILE The  per-process  limit  on  the  number  of  open  file
>               descriptors has been reached
> 
>        ENFILE The system-wide limit on the total number of open  files
>               has been reached.
> 
>        ENOMEM Insufficient kernel memory was available.
> 
> VERSIONS
>        The userfaultfd() system call first appeared in Linux 4.3.
> 
> CONFORMING TO
>        userfaultfd()  is Linux-specific and should not be used in pro‐
>        grams intended to be portable.
> 
> NOTES
>        Glibc does not provide a wrapper for this system call; call  it
>        using syscall(2).
> 
>        The userfaultfd mechanism can be used as an alternative to tra‐
>        ditional user-space paging techniques based on the use  of  the
>        SIGSEGV  signal  and mmap(2).  It can also be used to implement
>        lazy restore for  checkpoint/restore  mechanisms,  as  well  as
>        post-copy  migration  to allow (nearly) uninterrupted execution
>        when transferring virtual machines from one host to another.
> 
> EXAMPLE
>        The program below demonstrates the use of the userfaultfd mech‐
>        anism.   The  program creates two threads, one of which acts as
>        the page-fault handler for the process,  for  the  pages  in  a
>        demand-page zero region created using mmap(2).
> 
>        The  program takes one command-line argument, which is the num‐
>        ber of pages that will be  created  in  a  mapping  whose  page
>        faults will be handled via userfaultfd.  After creating a user‐
>        faultfd object, the program then creates an  anonymous  private
>        mapping  of  the specified size and registers the address range
>        of that mapping using the UFFDIO_REGISTER  ioctl(2)  operation.
>        The  program then creates a second thread that will perform the
>        task of handling page faults.
> 
>        The main thread then walks through the  pages  of  the  mapping
>        fetching  bytes  from successive pages.  Because the pages have
>        not yet been accessed, the first access of a byte in each  page
>        will  trigger  a  page-fault  event  on  the  userfaultfd  file
>        descriptor.
> 
>        Each of the page-fault events is handled by the second  thread,
>        which sits in a loop processing input from the userfaultfd file
>        descriptor.  In each loop iteration, the  second  thread  first
>        calls  poll(2)  to  check the state of the file descriptor, and
>        then reads an event from the file descriptor.  All such  events
>        should be UFFD_EVENT_PAGEFAULT events, which the thread handles
>        by copying a page of data into the faulting  region  using  the
>        UFFDIO_COPY ioctl(2) operation.
> 
>        The  following  is  an  example of what we see when running the
>        program:
> 
>            $ ./userfaultfd_demo 3
>            Address returned by mmap() = 0x7fd30106c000
> 
>            fault_handler_thread():
>                poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
>                UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
>                    (uffdio_copy.copy returned 4096)
>            Read address 0x7fd30106c00f in main(): A
>            Read address 0x7fd30106c40f in main(): A
>            Read address 0x7fd30106c80f in main(): A
>            Read address 0x7fd30106cc0f in main(): A
> 
>            fault_handler_thread():
>                poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
>                UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
>                    (uffdio_copy.copy returned 4096)
>            Read address 0x7fd30106d00f in main(): B
>            Read address 0x7fd30106d40f in main(): B
>            Read address 0x7fd30106d80f in main(): B
>            Read address 0x7fd30106dc0f in main(): B
> 
>            fault_handler_thread():
>                poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
>                UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
>                    (uffdio_copy.copy returned 4096)
>            Read address 0x7fd30106e00f in main(): C
>            Read address 0x7fd30106e40f in main(): C
>            Read address 0x7fd30106e80f in main(): C
>            Read address 0x7fd30106ec0f in main(): C
> 
>    Program source
> 
>        /* userfaultfd_demo.c
> 
>           Licensed under the GNU General Public License version 2 or later.
>        */
>        #define _GNU_SOURCE
>        #include <sys/types.h>
>        #include <stdio.h>
>        #include <linux/userfaultfd.h>
>        #include <pthread.h>
>        #include <errno.h>
>        #include <unistd.h>
>        #include <stdlib.h>
>        #include <fcntl.h>
>        #include <signal.h>
>        #include <poll.h>
>        #include <string.h>
>        #include <sys/mman.h>
>        #include <sys/syscall.h>
>        #include <sys/ioctl.h>
>        #include <poll.h>
> 
>        #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
>                                } while (0)
> 
>        static int page_size;
> 
>        static void *
>        fault_handler_thread(void *arg)
>        {
>            static struct uffd_msg msg;   /* Data read from userfaultfd */
>            static int fault_cnt = 0;     /* Number of faults so far handled */
>            long uffd;                    /* userfaultfd file descriptor */
>            static char *page = NULL;
>            struct uffdio_copy uffdio_copy;
>            ssize_t nread;
> 
>            uffd = (long) arg;
> 
>            /* Create a page that will be copied into the faulting region */
> 
>            if (page == NULL) {
>                page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
>                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>                if (page == MAP_FAILED)
>                    errExit("mmap");
>            }
> 
>            /* Loop, handling incoming events on the userfaultfd
>               file descriptor */
> 
>            for (;;) {
> 
>                /* See what poll() tells us about the userfaultfd */
> 
>                struct pollfd pollfd;
>                int nready;
>                pollfd.fd = uffd;
>                pollfd.events = POLLIN;
>                nready = poll(&pollfd, 1, -1);
>                if (nready == -1)
>                    errExit("poll");
> 
>                printf("\nfault_handler_thread():\n");
>                printf("    poll() returns: nready = %d; "
>                        "POLLIN = %d; POLLERR = %d\n", nready,
>                        (pollfd.revents & POLLIN) != 0,
>                        (pollfd.revents & POLLERR) != 0);
> 
>                /* Read an event from the userfaultfd */
> 
>                nread = read(uffd, &msg, sizeof(msg));
>                if (nread == 0) {
>                    printf("EOF on userfaultfd!\n");
>                    exit(EXIT_FAILURE);
>                }
> 
>                if (nread == -1)
>                    errExit("read");
> 
>                /* We expect only one kind of event; verify that assumption */
> 
>                if (msg.event != UFFD_EVENT_PAGEFAULT) {
>                    fprintf(stderr, "Unexpected event on userfaultfd\n");
>                    exit(EXIT_FAILURE);
>                }
> 
>                /* Display info about the page-fault event */
> 
>                printf("    UFFD_EVENT_PAGEFAULT event: ");
>                printf("flags = %llx; ", msg.arg.pagefault.flags);
>                printf("address = %llx\n", msg.arg.pagefault.address);
> 
>                /* Copy the page pointed to by 'page' into the faulting
>                   region. Vary the contents that are copied in, so that it
>                   is more obvious that each fault is handled separately. */
> 
>                memset(page, 'A' + fault_cnt % 20, page_size);
>                fault_cnt++;
> 
>                uffdio_copy.src = (unsigned long) page;
> 
>                /* We need to handle page faults in units of pages(!).
>                   So, round faulting address down to page boundary */
> 
>                uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
>                                                   ~(page_size - 1);
>                uffdio_copy.len = page_size;
>                uffdio_copy.mode = 0;
>                uffdio_copy.copy = 0;
>                if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
>                    errExit("ioctl-UFFDIO_COPY");
> 
>                printf("        (uffdio_copy.copy returned %lld)\n",
>                        uffdio_copy.copy);
>            }
>        }
> 
>        int
>        main(int argc, char *argv[])
>        {
>            long uffd;          /* userfaultfd file descriptor */
>            char *addr;         /* Start of region handled by userfaultfd */
>            unsigned long len;  /* Length of region handled by userfaultfd */
>            pthread_t thr;      /* ID of thread that handles page faults */
>            struct uffdio_api uffdio_api;
>            struct uffdio_register uffdio_register;
>            int s;
> 
>            if (argc != 2) {
>                fprintf(stderr, "Usage: %s num-pages\n", argv[0]);
>                exit(EXIT_FAILURE);
>            }
> 
>            page_size = sysconf(_SC_PAGE_SIZE);
>            len = strtoul(argv[1], NULL, 0) * page_size;
> 
>            /* Create and enable userfaultfd object */
> 
>            uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>            if (uffd == -1)
>                errExit("userfaultfd");
> 
>            uffdio_api.api = UFFD_API;
>            uffdio_api.features = 0;
>            if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
>                errExit("ioctl-UFFDIO_API");
> 
>            /* Create a private anonymous mapping. The memory will be
>               demand-zero paged--that is, not yet allocated. When we
>               actually touch the memory, it will be allocated via
>               the userfaultfd. */
> 
>            addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
>                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>            if (addr == MAP_FAILED)
>                errExit("mmap");
> 
>            printf("Address returned by mmap() = %p\n", addr);
> 
>            /* Register the memory range of the mapping we just created for
>               handling by the userfaultfd object. In mode, we request to track
>               missing pages (i.e., pages that have not yet been faulted in). */
> 
>            uffdio_register.range.start = (unsigned long) addr;
>            uffdio_register.range.len = len;
>            uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
>            if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
>                errExit("ioctl-UFFDIO_REGISTER");
> 
>            /* Create a thread that will process the userfaultfd events */
> 
>            s = pthread_create(&thr, NULL, fault_handler_thread, (void *) uffd);
>            if (s != 0) {
>                errno = s;
>                errExit("pthread_create");
>            }
> 
>            /* Main thread now touches memory in the mapping, touching
>               locations 1024 bytes apart. This will trigger userfaultfd
>               events for all pages in the region. */
> 
>            int l;
>            l = 0xf;    /* Ensure that faulting address is not on a page
>                           boundary, in order to test that we correctly
>                           handle that case in fault_handling_thread() */
>            while (l < len) {
>                char c = addr[l];
>                printf("Read address %p in main(): ", addr + l);
>                printf("%c\n", c);
>                l += 1024;
>                usleep(100000);         /* Slow things down a little */
>            }
> 
>            exit(EXIT_SUCCESS);
>        }
> 
> SEE ALSO
>        fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
> 
>        Documentation/vm/userfaultfd.txt in  the  Linux  kernel  source
>        tree
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft ioctl_userfaultfd(2) manual page
  2017-03-20 20:11 ` Review request: draft ioctl_userfaultfd(2) " Michael Kerrisk (man-pages)
@ 2017-03-22 13:54   ` Mike Rapoport
  2017-04-21  9:11     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2017-03-22 13:54 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Andrea Arcangeli; +Cc: lkml, linux-mm, linux-man

Hello Michael,

On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Andrea, Mike, and all,
> 
> Mike: here's the split out page that describes the 
> userfaultfd ioctl() operations.
> 
> I'd like to get review input, especially from you and
> Andrea, but also anyone else, for the current version
> of this page, which includes quite a few FIXMEs to be
> sorted.
> 
> I've shown the rendered version of the page below. 
> The groff source is attached, and can also be found
> at the branch here:
> 
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
> 
> The new ioctl_userfaultfd(2) page follows this mail.
> 
> Cheers,
> 
> Michael
> 
> NAME
>        userfaultfd - create a file descriptor for handling page faults in user
>        space
> 
> SYNOPSIS
>        #include <sys/ioctl.h>
> 
>        int ioctl(int fd, int cmd, ...);
> 
> DESCRIPTION
>        Various ioctl(2) operations can be performed on  a  userfaultfd  object
>        (created by a call to userfaultfd(2)) using calls of the form:
> 
>            ioctl(fd, cmd, argp);
> 
>        In  the  above,  fd  is  a  file  descriptor referring to a userfaultfd
>        object, cmd is one of the commands listed below, and argp is a  pointer
>        to a data structure that is specific to cmd.
> 
>        The  various  ioctl(2) operations are described below.  The UFFDIO_API,
>        UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
>        userfaultfd behavior.  These operations allow the caller to choose what
>        features will be enabled and what kinds of events will be delivered  to
>        the application.  The remaining operations are range operations.  These
>        operations enable the calling application to resolve page-fault  events
>        in a consistent way.
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │Above: What does "consistent" mean?                  │
>        │                                                     │
>        └─────────────────────────────────────────────────────┘

Andrea, can you please help with this one?
 
>    UFFDIO_API
>        (Since Linux 4.3.)  Enable operation of the userfaultfd and perform API
>        handshake.  The argp argument is a pointer to a  uffdio_api  structure,
>        defined as:
> 
>            struct uffdio_api {
>                __u64 api;        /* Requested API version (input) */
>                __u64 features;   /* Must be zero */
>                __u64 ioctls;     /* Available ioctl() operations (output) */
>            };
> 
>        The  api  field  denotes  the API version requested by the application.
>        Before the call, the features field must be initialized to zero.
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │Above: Why must the 'features' field be  initialized │
>        │to zero?                                             │
>        └─────────────────────────────────────────────────────┘

Until 4.11 the only supported feature is delegation of missing page fault
and the UFFDIO_FEATURES bitmask is 0.
There's a check in uffdio_api call that the user is not trying to enable
any other functionality and it asserts that uffdio_api.featurs is zero [1].
Starting from 4.11 the features negotiation is different. Now uffdio_call
verifies that it can support features the application requested [2].


>        The  kernel verifies that it can support the requested API version, and
>        sets the features and ioctls fields to bit masks representing  all  the
>        available features and the generic ioctl(2) operations available.  Cur‐
>        rently, zero (i.e., no feature bits) is placed in the  features  field.
>        The returned ioctls field can contain the following bits:
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │This  user-space  API  seems not fully polished. Why │
>        │are there not constants defined for each of the bit- │
>        │mask values listed below?                            │
>        └─────────────────────────────────────────────────────┘
> 
>        1 << _UFFDIO_API
>               The UFFDIO_API operation is supported.
> 
>        1 << _UFFDIO_REGISTER
>               The UFFDIO_REGISTER operation is supported.
> 
>        1 << _UFFDIO_UNREGISTER
>               The UFFDIO_UNREGISTER operation is supported.

Well, I tend to agree. I believe the original intention was to use the
OR'ed mask, like UFFD_API_IOCTLS.
Andrea, can you add somthing?

> 
> 
>               ┌─────────────────────────────────────────────────────┐
>               │FIXME                                                │
>               ├─────────────────────────────────────────────────────┤
>               │Is  the above description of the 'ioctls' field cor‐ │
>               │rect?  Does more need to be said?                    │
>               │                                                     │
>               └─────────────────────────────────────────────────────┘

This is correct. I wouldn't add anything else.
 
>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
>        and  errno  is set to indicate the cause of the error.  Possible errors
>        include:
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
G>        ├─────────────────────────────────────────────────────┤
>        │Is the following error list correct?                 │
>        │                                                     │
>        └─────────────────────────────────────────────────────┘

There's also -EFAULT in case copy_{from,to}_user fails.

> 
>        EINVAL The userfaultfd has already been  enabled  by  a  previous  UFF‐
>               DIO_API operation.
> 
>        EINVAL The  API  version requested in the api field is not supported by
>               this kernel, or the features field was not zero.
>
>               ┌─────────────────────────────────────────────────────┐
>               │FIXME                                                │
>               ├─────────────────────────────────────────────────────┤
>               │In the above error case, the  returned  'uffdio_api' │
>               │structure  zeroed out. Why is this done? This should │
>               │be explained in the manual page.                     │
>               │                                                     │
>               └─────────────────────────────────────────────────────┘
 
In my understanding the uffdio_api structure is zeroed to allow the caller
to distinguish the reasons for -EINVAL.

>    UFFDIO_REGISTER
>        (Since Linux 4.3.)  Register a memory  address  range  with  the  user‐
>        faultfd  object.   The  argp argument is a pointer to a uffdio_register
>        structure, defined as:
> 
>            struct uffdio_range {
>                __u64 start;    /* Start of range */
>                __u64 len;      /* Length of rnage (bytes) */
>            };
> 
>            struct uffdio_register {
>                struct uffdio_range range;
>                __u64 mode;     /* Desired mode of operation (input) */
>                __u64 ioctls;   /* Available ioctl() operations (output) */
>            };
> 
> 
>        The range field defines a memory range starting at start and continuing
>        for len bytes that should be handled by the userfaultfd.
> 
>        The  mode  field  defines the mode of operation desired for this memory
>        region.  The following values may be bitwise  ORed  to  set  the  user‐
>        faultfd mode for the specified range:
> 
>        UFFDIO_REGISTER_MODE_MISSING
>               Track page faults on missing pages.
> 
>        UFFDIO_REGISTER_MODE_WP
>               Track page faults on write-protected pages.
> 
>        Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
> 
>        If the operation is successful, the kernel modifies the ioctls bit-mask
>        field to indicate which ioctl(2) operations are available for the spec‐
>        ified range.  This returned bit mask is as for UFFDIO_API.
> 
>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
>        and errno is set to indicate the cause of the error.   Possible  errors
>        include:
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │Is the following error list correct?                 │
>        │                                                     │
>        └─────────────────────────────────────────────────────┘

Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
mm_struct has gone by the time userfault grabs it. 
 
>        EBUSY  A  mapping  in  the  specified  range is registered with another
>               userfaultfd object.
> 
>        EINVAL An invalid or unsupported bit was specified in the  mode  field;
>               or the mode field was zero.
> 
>        EINVAL There is no mapping in the specified address range.
> 
>        EINVAL range.start  or  range.len  is not a multiple of the system page
>               size; or, range.len is  zero;  or  these  fields  are  otherwise
>               invalid.
> 
>        EINVAL There as an incompatible mapping in the specified address range.
> 
> 
>               ┌─────────────────────────────────────────────────────┐
>               │FIXME                                                │
>               ├─────────────────────────────────────────────────────┤
>               │Above: What does "incompatible" mean?                │
>               │                                                     │
>               └─────────────────────────────────────────────────────┘

Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
MAP_PRIVATE mappings.

>    UFFDIO_UNREGISTER
>        (Since Linux 4.3.)  Unregister a memory address range from userfaultfd.
>        The address range to unregister is specified in the uffdio_range struc‐
>        ture pointed to by argp.
> 
>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
>        and errno is set to indicate the cause of the error.   Possible  errors
>        include:
> 
>        EINVAL Either  the  start or the len field of the ufdio_range structure
>               was not a multiple of the system page size; or the len field was
>               zero; or these fields were otherwise invalid.
> 
>        EINVAL There as an incompatible mapping in the specified address range.
> 
> 
>               ┌─────────────────────────────────────────────────────┐
>               │FIXME                                                │
>               ├─────────────────────────────────────────────────────┤
>               │Above: What does "incompatible" mean?                │
>               └─────────────────────────────────────────────────────┘

The same comments as for UFFDIO_REGISTER apply here as well.

>        EINVAL There was no mapping in the specified address range.
> 
>    UFFDIO_COPY
>        (Since  Linux 4.3.)  Atomically copy a continuous memory chunk into the
>        userfault registered range and optionally wake up the  blocked  thread.
>        The  source  and  destination addresses and the number of bytes to copy
>        are specified by the src, dst, and len fields of the uffdio_copy struc‐
>        ture pointed to by argp:
> 
>            struct uffdio_copy {
>                __u64 dst;    /* Source of copy */
>                __u64 src;    /* Destinate of copy */
>                __u64 len;    /* Number of bytes to copy */
>                __u64 mode;   /* Flags controlling behavior of copy */
>                __s64 copy;   /* Number of bytes copied, or negated error */
>            };
> 
>        The  following value may be bitwise ORed in mode to change the behavior
>        of the UFFDIO_COPY operation:
> 
>        UFFDIO_COPY_MODE_DONTWAKE
>               Do not wake up the thread that waits for page-fault resolution
> 
>        The copy field is used by the kernel to return the number of bytes that
>        was actually copied, or an error (a negated errno-style value).
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │Above:  Why is the 'copy' field used to return error │
>        │values?  This should  be  explained  in  the  manual │
>        │page.                                                │
>        └─────────────────────────────────────────────────────┘

Andrea, can you help with this one, please?

>        If  the  value returned in copy doesn't match the value that was speci‐
>        fied in len, the operation fails with the error EAGAIN.  The copy field
>        is output-only; it is not read by the UFFDIO_COPY operation.
> 
>        This ioctl(2) operation returns 0 on success.  In this case, the entire
>        area was copied.  On error, -1 is returned and errno is set to indicate
>        the cause of the error.  Possible errors include:
> 
>        EAGAIN The number of bytes copied (i.e., the value returned in the copy
>               field) does not equal the value that was specified  in  the  len
>               field.
> 
>        EINVAL Either dst or len was not a multiple of the system page size, or
>               the range specified by src and len or dst and len was invalid.
> 
>        EINVAL An invalid bit was specified in the mode field.
> 
>    UFFDIO_ZEROPAGE
>        (Since Linux 4.3.)  Zero out  a  memory  range  registered  with  user‐
>        faultfd.   The  requested  range is specified by the range field of the
>        uffdio_zeropage structure pointed to by argp:
> 
>            struct uffdio_zeropage {
>                struct uffdio_range range;
>                __u64 mode;     /* Flags controlling behavior of copy */
>                __s64 zeropage; /* Number of bytes zeroed, or negated error */
>            };
> 
>        The following value may be bitwise ORed in mode to change the  behavior
>        of the UFFDIO_ZERO operation:
> 
>        UFFDIO_ZEROPAGE_MODE_DONTWAKE
>               Do not wake up the thread that waits for page-fault resolution.
> 
>        The  zeropage field is used by the kernel to return the number of bytes
>        that was actually zeroed, or an  error  in  the  same  manner  as  UFF‐
>        DIO_COPY.
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │Why  is  the  'zeropage'  field used to return error │
>        │values?  This should  be  explained  in  the  manual │
>        │page.                                                │
>        └─────────────────────────────────────────────────────┘
>        If  the  value  returned  in the zeropage field doesn't match the value
>        that was specified in range.len, the operation  fails  with  the  error
>        EAGAIN.   The zeropage field is output-only; it is not read by the UFF‐
>        DIO_ZERO operation.
> 
>        This ioctl(2) operation returns 0 on success.  In this case, the entire
>        area was zeroed.  On error, -1 is returned and errno is set to indicate
>        the cause of the error.  Possible errors include:
> 
>        EAGAIN The number of bytes zeroed (i.e.,  the  value  returned  in  the
>               zeropage  field)  does not equal the value that was specified in
>               the range.len field.
> 
>        EINVAL Either range.start or range.len was not a multiple of the system
>               page  size;  or  range.len  was zero; or the range specified was
>               invalid.
> 
>        EINVAL An invalid bit was specified in the mode field.
> 
>    UFFDIO_WAKE
>        (Since Linux 4.3.)  Wake up the thread waiting for  page-fault  resolu‐
>        tion  on  a  specified  memory  address  range.  The argp argument is a
>        pointer to a uffdio_range structure (shown above)  that  specifies  the
>        address range.
> 
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │Need more detail here. How is the UFFDIO_WAKE opera‐ │
>        │tion used?                                           │
>        └─────────────────────────────────────────────────────┘

The UFFDIO_WAKE operation is used in conjunction with
UFFDIO_{COPY,ZEROPAGE} operations that have
UFFDIO_{COPY,ZEROPAGE}_MODE_DONTWAKE bit set in the mode field.
The userfault monitor can perform several UFFDIO_{COPY,ZEROPAGE} calls in a
batch and then explicitly wake up the faulting thread using UFFDIO_WAKE.

>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
>        and  errno  is set to indicate the cause of the error.  Possible errors
>        include:
> 
>        EINVAL The start or the len field of the ufdio_range structure was  not
>               a  multiple  of  the  system  page size; or len was zero; or the
>               specified range was otherwise invalid.
> 
> RETURN VALUE
>        See descriptions of the individual operations, above.
> 
> ERRORS
>        See descriptions of the individual operations, above.  In addition, the
>        following  general errors can occur for all of the operations described
>        above:
> 
>        EFAULT argp does not point to a valid memory address.
> 
>        EINVAL (For all operations except UFFDIO_API.)  The userfaultfd  object
>               has not yet been enabled (via the UFFDIO_API operation).
> 
> CONFORMING TO
>        These ioctl(2) operations are Linux-specific.
> 
> EXAMPLE
>        See userfaultfd(2).
> 
> SEE ALSO
>        ioctl(2), mmap(2), userfaultfd(2)
> 
>        Documentation/vm/userfaultfd.txt in the Linux kernel source tree
> 

[1] http://lxr.free-electrons.com/source/fs/userfaultfd.c#L1199
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/userfaultfd.c#n1680

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft userfaultfd(2) manual page
  2017-03-21 14:01 ` Review request: draft userfaultfd(2) " Mike Rapoport
@ 2017-04-21  6:30   ` Michael Kerrisk (man-pages)
  2017-04-21 11:06     ` Mike Rapoport
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-04-21  6:30 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: mtk.manpages, Andrea Arcangeli, lkml, linux-mm, linux-man

Hello Mike,

On 03/21/2017 03:01 PM, Mike Rapoport wrote:
> Hello Michael,
> 
> On Mon, Mar 20, 2017 at 09:08:05PM +0100, Michael Kerrisk (man-pages) wrote:
>> Hello Andrea, Mike, and all,
>>
>> Mike: thanks for the page that you sent. I've reworked it
>> a bit, and also added a lot of further information,
>> and an example program. In the process, I split the page
>> into two pieces, with one piece describing the userfaultfd()
>> system call and the other describing the ioctl() operations.
>>
>> I'd like to get review input, especially from you and
>> Andrea, but also anyone else, for the current version
>> of this page, which includes a few FIXMEs to be sorted.
> 
> Thanks for the update. I'm adressing the FIXME points you've mentioned
> below.

Thanks!

> Otherwise, everything seems the right description of the current upstream.
> 4.11 will have quite a few updates to userfault and we'll need to udpate
> this page and ioctl_userfaultfd(2) to address those updates. I am planning
> to work on the man update in the next few weeks. 
>  
>> I've shown the rendered version of the page below. 
>> The groff source is attached, and can also be found
>> at the branch here:
>  
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>>
>> The new ioctl_userfaultfd(2) page follows this mail.
>>
>> Cheers,
>>
>> Michael
>  
> --
> Sincerely yours,
> Mike. 
>  
> 
>> USERFAULTFD(2)         Linux Programmer's Manual        USERFAULTFD(2)
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME                                                │
>> ├─────────────────────────────────────────────────────┤
>> │Need  to  describe close(2) semantics for userfaulfd │
>> │file descriptor: what happens when  the  userfaultfd │
>> │FD is closed?                                        │
>> │                                                     │
>> └─────────────────────────────────────────────────────┘
>  
> When userfaultfd is closed, it unregisters all memory ranges that were
> previously registered with it and flushes the outstanding page fault
> events.

Presumably, this is more precisely stated as, "when the last
file descriptor referring to a userfaultfd object is closed..."?

I've made the text:

       When the last file descriptor referring to a userfaultfd object
       is  closed,  all  memory  ranges  that were registered with the
       object  are  unregistered  and  unread  page-fault  events  are
       flushed.

[...]

>>    Reading from the userfaultfd structure
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │are the details below correct?                       │
>>        └─────────────────────────────────────────────────────┘
> 
> Yes, at least for the current upstream version. 4.11 will have quite a few
> updates to userfaultfd.

Okay.

>>        Each read(2) from the userfaultfd file descriptor  returns  one
>>        or  more  uffd_msg  structures, each of which describes a page-
>>        fault event:
>>
>>            struct uffd_msg {
>>                __u8  event;                /* Type of event */
>>                ...
>>                union {
>>                    struct {
>>                        __u64 flags;        /* Flags describing fault */
>>                        __u64 address;      /* Faulting address */
>>                    } pagefault;
>>                    ...
>>                } arg;
>>
>>                /* Padding fields omitted */
>>            } __packed;
>>
>>        If multiple events are available and  the  supplied  buffer  is
>>        large enough, read(2) returns as many events as will fit in the
>>        supplied buffer.  If the buffer supplied to read(2) is  smaller
>>        than the size of the uffd_msg structure, the read(2) fails with
>>        the error EINVAL.
>>
>>        The fields set in the uffd_msg structure are as follows:
>>
>>        event  The type of event.  Currently, only one value can appear
>>               in  this  field: UFFD_EVENT_PAGEFAULT, which indicates a
>>               page-fault event.
>>
>>        address
>>               The address that triggered the page fault.
>>
>>        flags  A bit mask  of  flags  that  describe  the  event.   For
>>               UFFD_EVENT_PAGEFAULT, the following flag may appear:
>>
>>               UFFD_PAGEFAULT_FLAG_WRITE
>>                      If  the address is in a range that was registered
>>                      with the UFFDIO_REGISTER_MODE_MISSING  flag  (see
>>                      ioctl_userfaultfd(2))  and this flag is set, this
>>                      a write fault; otherwise it is a read fault.
>>
>>        A read(2) on a userfaultfd file descriptor can  fail  with  the
>>        following errors:
>>
>>        EINVAL The  userfaultfd  object  has not yet been enabled using
>>               the UFFDIO_API ioctl(2) operation
>>
>>        The userfaultfd file descriptor can be monitored with  poll(2),
>>        select(2),  and  epoll(7).  When events are available, the file
>>        descriptor indicates as readable.
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │But, it seems,  the  object  must  be  created  with │
>>        │O_NONBLOCK.  What is the rationale for this require‐ │
>>        │ment? Something needs to  be  said  in  this  manual │
>>        │page.                                                │
>>        └─────────────────────────────────────────────────────┘
> 
> The object can be created without O_NONBLOCK, so probably the above
> sentence can be rephrased as:
> 
> When the userfaultfd file descriptor is opened in non-blocking mode, it can
> be monitored with ...

Yes, but why is there this requirement for poll() etc. with the
O_NONBLOCK flag? I think something about that needs to be said in the 
man page. Sorry, my FIXME was not clear enough. I've reworded the text 
and the FIXME:

       If the O_NONBLOCK flag is enabled in the associated  open  file
       description,  the  userfaultfd file descriptor can be monitored
       with poll(2), select(2), and epoll(7).  When events are  avail‐
       able, the file descriptor indicates as readable.  If the O_NON‐
       BLOCK flag is not enabled, then poll(2) (always) indicates  the
       file as having a POLLERR condition, and select(2) indicates the
       file descriptor as both readable and writable.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │What is the reason for this seemingly  odd  behavior │
       │with  respect  to  the  O_NONBLOCK  flag? (see user‐ │
       │faultfd_poll()  in   fs/userfaultfd.c).    Something │
       │needs to be said about this.                         │
       └─────────────────────────────────────────────────────┘

[...]

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft ioctl_userfaultfd(2) manual page
  2017-03-22 13:54   ` Mike Rapoport
@ 2017-04-21  9:11     ` Michael Kerrisk (man-pages)
  2017-04-21 11:07       ` Mike Rapoport
  2017-05-03 21:46       ` Andrea Arcangeli
  0 siblings, 2 replies; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-04-21  9:11 UTC (permalink / raw)
  To: Mike Rapoport, Andrea Arcangeli; +Cc: mtk.manpages, lkml, linux-mm, linux-man

Hello Mike,
Hello Andrea (we need your help!),

On 03/22/2017 02:54 PM, Mike Rapoport wrote:
> Hello Michael,
> 
> On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
>> Hello Andrea, Mike, and all,
>>
>> Mike: here's the split out page that describes the 
>> userfaultfd ioctl() operations.
>>
>> I'd like to get review input, especially from you and
>> Andrea, but also anyone else, for the current version
>> of this page, which includes quite a few FIXMEs to be
>> sorted.
>>
>> I've shown the rendered version of the page below. 
>> The groff source is attached, and can also be found
>> at the branch here:
>>
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>>
>> The new ioctl_userfaultfd(2) page follows this mail.
>>
>> Cheers,
>>
>> Michael
>>
>> NAME
>>        userfaultfd - create a file descriptor for handling page faults in user
>>        space
>>
>> SYNOPSIS
>>        #include <sys/ioctl.h>
>>
>>        int ioctl(int fd, int cmd, ...);
>>
>> DESCRIPTION
>>        Various ioctl(2) operations can be performed on  a  userfaultfd  object
>>        (created by a call to userfaultfd(2)) using calls of the form:
>>
>>            ioctl(fd, cmd, argp);
>>
>>        In  the  above,  fd  is  a  file  descriptor referring to a userfaultfd
>>        object, cmd is one of the commands listed below, and argp is a  pointer
>>        to a data structure that is specific to cmd.
>>
>>        The  various  ioctl(2) operations are described below.  The UFFDIO_API,
>>        UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
>>        userfaultfd behavior.  These operations allow the caller to choose what
>>        features will be enabled and what kinds of events will be delivered  to
>>        the application.  The remaining operations are range operations.  These
>>        operations enable the calling application to resolve page-fault  events
>>        in a consistent way.
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │Above: What does "consistent" mean?                  │
>>        │                                                     │
>>        └─────────────────────────────────────────────────────┘
> 
> Andrea, can you please help with this one?

Let's see what Andrea has to say.

>>    UFFDIO_API
>>        (Since Linux 4.3.)  Enable operation of the userfaultfd and perform API
>>        handshake.  The argp argument is a pointer to a  uffdio_api  structure,
>>        defined as:
>>
>>            struct uffdio_api {
>>                __u64 api;        /* Requested API version (input) */
>>                __u64 features;   /* Must be zero */
>>                __u64 ioctls;     /* Available ioctl() operations (output) */
>>            };
>>
>>        The  api  field  denotes  the API version requested by the application.
>>        Before the call, the features field must be initialized to zero.
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │Above: Why must the 'features' field be  initialized │
>>        │to zero?                                             │
>>        └─────────────────────────────────────────────────────┘
> 
> Until 4.11 the only supported feature is delegation of missing page fault
> and the UFFDIO_FEATURES bitmask is 0.

So, the thing that was not clear, but now I think I understand:
'features' is an input field where one can ask about supported features
(but none are supported, before Linux 4.11). Is that correct? I've changed
the text here to read:

       Before the call, the features field must be  initialized
       to  zero.  In the future, it is intended that this field can be
       used to ask whether particular features are supported.

Seem okay?

> There's a check in uffdio_api call that the user is not trying to enable
> any other functionality and it asserts that uffdio_api.featurs is zero [1].
> Starting from 4.11 the features negotiation is different. Now uffdio_call
> verifies that it can support features the application requested [2].

Okay.

>>        The  kernel verifies that it can support the requested API version, and
>>        sets the features and ioctls fields to bit masks representing  all  the
>>        available features and the generic ioctl(2) operations available.  Cur‐
>>        rently, zero (i.e., no feature bits) is placed in the  features  field.
>>        The returned ioctls field can contain the following bits:
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │This  user-space  API  seems not fully polished. Why │
>>        │are there not constants defined for each of the bit- │
>>        │mask values listed below?                            │
>>        └─────────────────────────────────────────────────────┘
>>
>>        1 << _UFFDIO_API
>>               The UFFDIO_API operation is supported.
>>
>>        1 << _UFFDIO_REGISTER
>>               The UFFDIO_REGISTER operation is supported.
>>
>>        1 << _UFFDIO_UNREGISTER
>>               The UFFDIO_UNREGISTER operation is supported.
> 
> Well, I tend to agree. I believe the original intention was to use the
> OR'ed mask, like UFFD_API_IOCTLS.
> Andrea, can you add somthing?

Yes, Andrea, please!

>>
>>
>>               ┌─────────────────────────────────────────────────────┐
>>               │FIXME                                                │
>>               ├─────────────────────────────────────────────────────┤
>>               │Is  the above description of the 'ioctls' field cor‐ │
>>               │rect?  Does more need to be said?                    │
>>               │                                                     │
>>               └─────────────────────────────────────────────────────┘
> 
> This is correct. I wouldn't add anything else.

Thanks.

>>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
>>        and  errno  is set to indicate the cause of the error.  Possible errors
>>        include:
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
> G>        ├─────────────────────────────────────────────────────┤
>>        │Is the following error list correct?                 │
>>        │                                                     │
>>        └─────────────────────────────────────────────────────┘
> 
> There's also -EFAULT in case copy_{from,to}_user fails.

Okay -- I have added that error.

>>
>>        EINVAL The userfaultfd has already been  enabled  by  a  previous  UFF‐
>>               DIO_API operation.
>>
>>        EINVAL The  API  version requested in the api field is not supported by
>>               this kernel, or the features field was not zero.
>>
>>               ┌─────────────────────────────────────────────────────┐
>>               │FIXME                                                │
>>               ├─────────────────────────────────────────────────────┤
>>               │In the above error case, the  returned  'uffdio_api' │
>>               │structure  zeroed out. Why is this done? This should │
>>               │be explained in the manual page.                     │
>>               │                                                     │
>>               └─────────────────────────────────────────────────────┘
>  
> In my understanding the uffdio_api structure is zeroed to allow the caller
> to distinguish the reasons for -EINVAL.

Andrea, can you please help here?


>>    UFFDIO_REGISTER
>>        (Since Linux 4.3.)  Register a memory  address  range  with  the  user‐
>>        faultfd  object.   The  argp argument is a pointer to a uffdio_register
>>        structure, defined as:
>>
>>            struct uffdio_range {
>>                __u64 start;    /* Start of range */
>>                __u64 len;      /* Length of rnage (bytes) */
>>            };
>>
>>            struct uffdio_register {
>>                struct uffdio_range range;
>>                __u64 mode;     /* Desired mode of operation (input) */
>>                __u64 ioctls;   /* Available ioctl() operations (output) */
>>            };
>>
>>
>>        The range field defines a memory range starting at start and continuing
>>        for len bytes that should be handled by the userfaultfd.
>>
>>        The  mode  field  defines the mode of operation desired for this memory
>>        region.  The following values may be bitwise  ORed  to  set  the  user‐
>>        faultfd mode for the specified range:
>>
>>        UFFDIO_REGISTER_MODE_MISSING
>>               Track page faults on missing pages.
>>
>>        UFFDIO_REGISTER_MODE_WP
>>               Track page faults on write-protected pages.
>>
>>        Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
>>
>>        If the operation is successful, the kernel modifies the ioctls bit-mask
>>        field to indicate which ioctl(2) operations are available for the spec‐
>>        ified range.  This returned bit mask is as for UFFDIO_API.
>>
>>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
>>        and errno is set to indicate the cause of the error.   Possible  errors
>>        include:
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │Is the following error list correct?                 │
>>        │                                                     │
>>        └─────────────────────────────────────────────────────┘
> 
> Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
> And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
> mm_struct has gone by the time userfault grabs it. 

Okay -- added EFAULT. I think I'll skip ENOMEM for the moment, but
will note the possibility in the page source.

>>        EBUSY  A  mapping  in  the  specified  range is registered with another
>>               userfaultfd object.
>>
>>        EINVAL An invalid or unsupported bit was specified in the  mode  field;
>>               or the mode field was zero.
>>
>>        EINVAL There is no mapping in the specified address range.
>>
>>        EINVAL range.start  or  range.len  is not a multiple of the system page
>>               size; or, range.len is  zero;  or  these  fields  are  otherwise
>>               invalid.
>>
>>        EINVAL There as an incompatible mapping in the specified address range.
>>
>>
>>               ┌─────────────────────────────────────────────────────┐
>>               │FIXME                                                │
>>               ├─────────────────────────────────────────────────────┤
>>               │Above: What does "incompatible" mean?                │
>>               │                                                     │
>>               └─────────────────────────────────────────────────────┘
> 
> Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
> MAP_PRIVATE mappings.

Hmmm -- this restriction is not actually mentioned in the description
of UFFDIO_REGISTER. So, at the start of the description of that operation, 
I've made the text as follows:

[[
.SS UFFDIO_REGISTER
(Since Linux 4.3.)
Register a memory address range with the userfaultfd object.
The pages in the range must be "compatible".
In the current implementation,
.\" According to Mike Rapoport, this will change in Linux 4.11.
only private anonymous ranges are compatible for registering with
.BR UFFDIO_REGISTER .
]]

Okay?

>>    UFFDIO_UNREGISTER
>>        (Since Linux 4.3.)  Unregister a memory address range from userfaultfd.
>>        The address range to unregister is specified in the uffdio_range struc‐
>>        ture pointed to by argp.
>>
>>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
>>        and errno is set to indicate the cause of the error.   Possible  errors
>>        include:
>>
>>        EINVAL Either  the  start or the len field of the ufdio_range structure
>>               was not a multiple of the system page size; or the len field was
>>               zero; or these fields were otherwise invalid.
>>
>>        EINVAL There as an incompatible mapping in the specified address range.
>>
>>
>>               ┌─────────────────────────────────────────────────────┐
>>               │FIXME                                                │
>>               ├─────────────────────────────────────────────────────┤
>>               │Above: What does "incompatible" mean?                │
>>               └─────────────────────────────────────────────────────┘
> 
> The same comments as for UFFDIO_REGISTER apply here as well.

Okay. I changed the introductory text on UFFDIO_UNREGISTER to say:

[[
.SS UFFDIO_UNREGISTER
(Since Linux 4.3.)
Unregister a memory address range from userfaultfd.
The pages in the range must be "compatible" (see the description of
.BR  UFFDIO_REGISTER .)
]]

Okay?

>>        EINVAL There was no mapping in the specified address range.
>>
>>    UFFDIO_COPY
>>        (Since  Linux 4.3.)  Atomically copy a continuous memory chunk into the
>>        userfault registered range and optionally wake up the  blocked  thread.
>>        The  source  and  destination addresses and the number of bytes to copy
>>        are specified by the src, dst, and len fields of the uffdio_copy struc‐
>>        ture pointed to by argp:
>>
>>            struct uffdio_copy {
>>                __u64 dst;    /* Source of copy */
>>                __u64 src;    /* Destinate of copy */
>>                __u64 len;    /* Number of bytes to copy */
>>                __u64 mode;   /* Flags controlling behavior of copy */
>>                __s64 copy;   /* Number of bytes copied, or negated error */
>>            };
>>
>>        The  following value may be bitwise ORed in mode to change the behavior
>>        of the UFFDIO_COPY operation:
>>
>>        UFFDIO_COPY_MODE_DONTWAKE
>>               Do not wake up the thread that waits for page-fault resolution
>>
>>        The copy field is used by the kernel to return the number of bytes that
>>        was actually copied, or an error (a negated errno-style value).
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │Above:  Why is the 'copy' field used to return error │
>>        │values?  This should  be  explained  in  the  manual │
>>        │page.                                                │
>>        └─────────────────────────────────────────────────────┘
> 
> Andrea, can you help with this one, please?

Yes, Andrea, please.

>>        If  the  value returned in copy doesn't match the value that was speci‐
>>        fied in len, the operation fails with the error EAGAIN.  The copy field
>>        is output-only; it is not read by the UFFDIO_COPY operation.
>>
>>        This ioctl(2) operation returns 0 on success.  In this case, the entire
>>        area was copied.  On error, -1 is returned and errno is set to indicate
>>        the cause of the error.  Possible errors include:
>>
>>        EAGAIN The number of bytes copied (i.e., the value returned in the copy
>>               field) does not equal the value that was specified  in  the  len
>>               field.
>>
>>        EINVAL Either dst or len was not a multiple of the system page size, or
>>               the range specified by src and len or dst and len was invalid.
>>
>>        EINVAL An invalid bit was specified in the mode field.
>>
>>    UFFDIO_ZEROPAGE
>>        (Since Linux 4.3.)  Zero out  a  memory  range  registered  with  user‐
>>        faultfd.   The  requested  range is specified by the range field of the
>>        uffdio_zeropage structure pointed to by argp:
>>
>>            struct uffdio_zeropage {
>>                struct uffdio_range range;
>>                __u64 mode;     /* Flags controlling behavior of copy */
>>                __s64 zeropage; /* Number of bytes zeroed, or negated error */
>>            };
>>
>>        The following value may be bitwise ORed in mode to change the  behavior
>>        of the UFFDIO_ZERO operation:
>>
>>        UFFDIO_ZEROPAGE_MODE_DONTWAKE
>>               Do not wake up the thread that waits for page-fault resolution.
>>
>>        The  zeropage field is used by the kernel to return the number of bytes
>>        that was actually zeroed, or an  error  in  the  same  manner  as  UFF‐
>>        DIO_COPY.
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │Why  is  the  'zeropage'  field used to return error │
>>        │values?  This should  be  explained  in  the  manual │
>>        │page.                                                │
>>        └─────────────────────────────────────────────────────┘

Help is still needed for this FIXME!

>>        If  the  value  returned  in the zeropage field doesn't match the value
>>        that was specified in range.len, the operation  fails  with  the  error
>>        EAGAIN.   The zeropage field is output-only; it is not read by the UFF‐
>>        DIO_ZERO operation.
>>
>>        This ioctl(2) operation returns 0 on success.  In this case, the entire
>>        area was zeroed.  On error, -1 is returned and errno is set to indicate
>>        the cause of the error.  Possible errors include:
>>
>>        EAGAIN The number of bytes zeroed (i.e.,  the  value  returned  in  the
>>               zeropage  field)  does not equal the value that was specified in
>>               the range.len field.
>>
>>        EINVAL Either range.start or range.len was not a multiple of the system
>>               page  size;  or  range.len  was zero; or the range specified was
>>               invalid.
>>
>>        EINVAL An invalid bit was specified in the mode field.
>>
>>    UFFDIO_WAKE
>>        (Since Linux 4.3.)  Wake up the thread waiting for  page-fault  resolu‐
>>        tion  on  a  specified  memory  address  range.  The argp argument is a
>>        pointer to a uffdio_range structure (shown above)  that  specifies  the
>>        address range.
>>
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │Need more detail here. How is the UFFDIO_WAKE opera‐ │
>>        │tion used?                                           │
>>        └─────────────────────────────────────────────────────┘
> 
> The UFFDIO_WAKE operation is used in conjunction with
> UFFDIO_{COPY,ZEROPAGE} operations that have
> UFFDIO_{COPY,ZEROPAGE}_MODE_DONTWAKE bit set in the mode field.
> The userfault monitor can perform several UFFDIO_{COPY,ZEROPAGE} calls in a
> batch and then explicitly wake up the faulting thread using UFFDIO_WAKE.

Perfect! I've tweaked that text a little and added to the page.

>>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
>>        and  errno  is set to indicate the cause of the error.  Possible errors
>>        include:
>>
>>        EINVAL The start or the len field of the ufdio_range structure was  not
>>               a  multiple  of  the  system  page size; or len was zero; or the
>>               specified range was otherwise invalid.
>>
>> RETURN VALUE
>>        See descriptions of the individual operations, above.
>>
>> ERRORS
>>        See descriptions of the individual operations, above.  In addition, the
>>        following  general errors can occur for all of the operations described
>>        above:
>>
>>        EFAULT argp does not point to a valid memory address.
>>
>>        EINVAL (For all operations except UFFDIO_API.)  The userfaultfd  object
>>               has not yet been enabled (via the UFFDIO_API operation).
>>
>> CONFORMING TO
>>        These ioctl(2) operations are Linux-specific.
>>
>> EXAMPLE
>>        See userfaultfd(2).
>>
>> SEE ALSO
>>        ioctl(2), mmap(2), userfaultfd(2)
>>
>>        Documentation/vm/userfaultfd.txt in the Linux kernel source tree
>>
> 
> [1] http://lxr.free-electrons.com/source/fs/userfaultfd.c#L1199
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/userfaultfd.c#n1680

The current version of the two pages has been pushed to 
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft userfaultfd(2) manual page
  2017-04-21  6:30   ` Michael Kerrisk (man-pages)
@ 2017-04-21 11:06     ` Mike Rapoport
  2017-04-21 11:30       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2017-04-21 11:06 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: Andrea Arcangeli, lkml, linux-mm, linux-man

On Fri, Apr 21, 2017 at 08:30:55AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Mike,
> 
> On 03/21/2017 03:01 PM, Mike Rapoport wrote:
> > Hello Michael,
> > 
> > On Mon, Mar 20, 2017 at 09:08:05PM +0100, Michael Kerrisk (man-pages) wrote:
> >> Hello Andrea, Mike, and all,
> >>
> >> Mike: thanks for the page that you sent. I've reworked it
> >> a bit, and also added a lot of further information,
> >> and an example program. In the process, I split the page
> >> into two pieces, with one piece describing the userfaultfd()
> >> system call and the other describing the ioctl() operations.
> >>
> >> I'd like to get review input, especially from you and
> >> Andrea, but also anyone else, for the current version
> >> of this page, which includes a few FIXMEs to be sorted.
> > 
> > Thanks for the update. I'm adressing the FIXME points you've mentioned
> > below.
> 
> Thanks!
> 
> > Otherwise, everything seems the right description of the current upstream.
> > 4.11 will have quite a few updates to userfault and we'll need to udpate
> > this page and ioctl_userfaultfd(2) to address those updates. I am planning
> > to work on the man update in the next few weeks. 
> >  
> >> I've shown the rendered version of the page below. 
> >> The groff source is attached, and can also be found
> >> at the branch here:
> >  
> >> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
> >>
> >> The new ioctl_userfaultfd(2) page follows this mail.
> >>
> >> Cheers,
> >>
> >> Michael
> >  
> > --
> > Sincerely yours,
> > Mike. 
> >  
> > 
> >> USERFAULTFD(2)         Linux Programmer's Manual        USERFAULTFD(2)
> >>
> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME                                                │
> >> ├─────────────────────────────────────────────────────┤
> >> │Need  to  describe close(2) semantics for userfaulfd │
> >> │file descriptor: what happens when  the  userfaultfd │
> >> │FD is closed?                                        │
> >> │                                                     │
> >> └─────────────────────────────────────────────────────┘
> >  
> > When userfaultfd is closed, it unregisters all memory ranges that were
> > previously registered with it and flushes the outstanding page fault
> > events.
> 
> Presumably, this is more precisely stated as, "when the last
> file descriptor referring to a userfaultfd object is closed..."?

You are right.
 
> I've made the text:
> 
>        When the last file descriptor referring to a userfaultfd object
>        is  closed,  all  memory  ranges  that were registered with the
>        object  are  unregistered  and  unread  page-fault  events  are
>        flushed.
> 
> [...]

Perfect.
 
> >>    Reading from the userfaultfd structure
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │are the details below correct?                       │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Yes, at least for the current upstream version. 4.11 will have quite a few
> > updates to userfaultfd.
> 
> Okay.
> 
> >>        Each read(2) from the userfaultfd file descriptor  returns  one
> >>        or  more  uffd_msg  structures, each of which describes a page-
> >>        fault event:
> >>
> >>            struct uffd_msg {
> >>                __u8  event;                /* Type of event */
> >>                ...
> >>                union {
> >>                    struct {
> >>                        __u64 flags;        /* Flags describing fault */
> >>                        __u64 address;      /* Faulting address */
> >>                    } pagefault;
> >>                    ...
> >>                } arg;
> >>
> >>                /* Padding fields omitted */
> >>            } __packed;
> >>
> >>        If multiple events are available and  the  supplied  buffer  is
> >>        large enough, read(2) returns as many events as will fit in the
> >>        supplied buffer.  If the buffer supplied to read(2) is  smaller
> >>        than the size of the uffd_msg structure, the read(2) fails with
> >>        the error EINVAL.
> >>
> >>        The fields set in the uffd_msg structure are as follows:
> >>
> >>        event  The type of event.  Currently, only one value can appear
> >>               in  this  field: UFFD_EVENT_PAGEFAULT, which indicates a
> >>               page-fault event.
> >>
> >>        address
> >>               The address that triggered the page fault.
> >>
> >>        flags  A bit mask  of  flags  that  describe  the  event.   For
> >>               UFFD_EVENT_PAGEFAULT, the following flag may appear:
> >>
> >>               UFFD_PAGEFAULT_FLAG_WRITE
> >>                      If  the address is in a range that was registered
> >>                      with the UFFDIO_REGISTER_MODE_MISSING  flag  (see
> >>                      ioctl_userfaultfd(2))  and this flag is set, this
> >>                      a write fault; otherwise it is a read fault.
> >>
> >>        A read(2) on a userfaultfd file descriptor can  fail  with  the
> >>        following errors:
> >>
> >>        EINVAL The  userfaultfd  object  has not yet been enabled using
> >>               the UFFDIO_API ioctl(2) operation
> >>
> >>        The userfaultfd file descriptor can be monitored with  poll(2),
> >>        select(2),  and  epoll(7).  When events are available, the file
> >>        descriptor indicates as readable.
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │But, it seems,  the  object  must  be  created  with │
> >>        │O_NONBLOCK.  What is the rationale for this require‐ │
> >>        │ment? Something needs to  be  said  in  this  manual │
> >>        │page.                                                │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > The object can be created without O_NONBLOCK, so probably the above
> > sentence can be rephrased as:
> > 
> > When the userfaultfd file descriptor is opened in non-blocking mode, it can
> > be monitored with ...
> 
> Yes, but why is there this requirement for poll() etc. with the
> O_NONBLOCK flag? I think something about that needs to be said in the 
> man page. Sorry, my FIXME was not clear enough. I've reworded the text 
> and the FIXME:
> 
>        If the O_NONBLOCK flag is enabled in the associated  open  file
>        description,  the  userfaultfd file descriptor can be monitored
>        with poll(2), select(2), and epoll(7).  When events are  avail‐
>        able, the file descriptor indicates as readable.  If the O_NON‐
>        BLOCK flag is not enabled, then poll(2) (always) indicates  the
>        file as having a POLLERR condition, and select(2) indicates the
>        file descriptor as both readable and writable.
> 
>        ┌─────────────────────────────────────────────────────┐
>        │FIXME                                                │
>        ├─────────────────────────────────────────────────────┤
>        │What is the reason for this seemingly  odd  behavior │
>        │with  respect  to  the  O_NONBLOCK  flag? (see user‐ │
>        │faultfd_poll()  in   fs/userfaultfd.c).    Something │
>        │needs to be said about this.                         │
>        └─────────────────────────────────────────────────────┘

Andrea, can you please help with this one as well?

> [...]
> 
> Thanks,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

--
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft ioctl_userfaultfd(2) manual page
  2017-04-21  9:11     ` Michael Kerrisk (man-pages)
@ 2017-04-21 11:07       ` Mike Rapoport
  2017-04-21 11:41         ` Michael Kerrisk (man-pages)
  2017-05-03 21:46       ` Andrea Arcangeli
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2017-04-21 11:07 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: Andrea Arcangeli, lkml, linux-mm, linux-man

Hello Michael,

On Fri, Apr 21, 2017 at 11:11:18AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Mike,
> Hello Andrea (we need your help!),
> 
> On 03/22/2017 02:54 PM, Mike Rapoport wrote:
> > Hello Michael,
> > 
> > On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
> >> Hello Andrea, Mike, and all,
> >>
> >> Mike: here's the split out page that describes the 
> >> userfaultfd ioctl() operations.
> >>
> >> I'd like to get review input, especially from you and
> >> Andrea, but also anyone else, for the current version
> >> of this page, which includes quite a few FIXMEs to be
> >> sorted.
> >>
> >> I've shown the rendered version of the page below. 
> >> The groff source is attached, and can also be found
> >> at the branch here:
> >>
> >> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
> >>
> >> The new ioctl_userfaultfd(2) page follows this mail.
> >>
> >> Cheers,
> >>
> >> Michael
> >>
> >> NAME
> >>        userfaultfd - create a file descriptor for handling page faults in user
> >>        space
> >>
> >> SYNOPSIS
> >>        #include <sys/ioctl.h>
> >>
> >>        int ioctl(int fd, int cmd, ...);
> >>
> >> DESCRIPTION
> >>        Various ioctl(2) operations can be performed on  a  userfaultfd  object
> >>        (created by a call to userfaultfd(2)) using calls of the form:
> >>
> >>            ioctl(fd, cmd, argp);
> >>
> >>        In  the  above,  fd  is  a  file  descriptor referring to a userfaultfd
> >>        object, cmd is one of the commands listed below, and argp is a  pointer
> >>        to a data structure that is specific to cmd.
> >>
> >>        The  various  ioctl(2) operations are described below.  The UFFDIO_API,
> >>        UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
> >>        userfaultfd behavior.  These operations allow the caller to choose what
> >>        features will be enabled and what kinds of events will be delivered  to
> >>        the application.  The remaining operations are range operations.  These
> >>        operations enable the calling application to resolve page-fault  events
> >>        in a consistent way.
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Above: What does "consistent" mean?                  │
> >>        │                                                     │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Andrea, can you please help with this one?
> 
> Let's see what Andrea has to say.

Actually, I though I've copied this text from Andrea's docs, but now I've
found out it was my wording and I really don't remember now what was my
intention for "consistent" :)
My guess is that I was thinking about atomicity of UFFDIO_COPY, or the fact
that from the faulting thread perspective the page fault handling is the
same whether it's done in kernel or via userfaultfd...
That said, maybe it'd be better just to drop "in a consistent way".

 
> >>    UFFDIO_API
> >>        (Since Linux 4.3.)  Enable operation of the userfaultfd and perform API
> >>        handshake.  The argp argument is a pointer to a  uffdio_api  structure,
> >>        defined as:
> >>
> >>            struct uffdio_api {
> >>                __u64 api;        /* Requested API version (input) */
> >>                __u64 features;   /* Must be zero */
> >>                __u64 ioctls;     /* Available ioctl() operations (output) */
> >>            };
> >>
> >>        The  api  field  denotes  the API version requested by the application.
> >>        Before the call, the features field must be initialized to zero.
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Above: Why must the 'features' field be  initialized │
> >>        │to zero?                                             │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Until 4.11 the only supported feature is delegation of missing page fault
> > and the UFFDIO_FEATURES bitmask is 0.
> 
> So, the thing that was not clear, but now I think I understand:
> 'features' is an input field where one can ask about supported features
> (but none are supported, before Linux 4.11). Is that correct?

Yes.

> I've changed the text here to read:
> 
>        Before the call, the features field must be  initialized
>        to  zero.  In the future, it is intended that this field can be
>        used to ask whether particular features are supported.
> 
> Seem okay?

Yes.
Just the future is only a week or two from today as we are at 4.11-rc7 :)

> > There's a check in uffdio_api call that the user is not trying to enable
> > any other functionality and it asserts that uffdio_api.featurs is zero [1].
> > Starting from 4.11 the features negotiation is different. Now uffdio_call
> > verifies that it can support features the application requested [2].
> 
> Okay.
> 
> >>        The  kernel verifies that it can support the requested API version, and
> >>        sets the features and ioctls fields to bit masks representing  all  the
> >>        available features and the generic ioctl(2) operations available.  Cur‐
> >>        rently, zero (i.e., no feature bits) is placed in the  features  field.
> >>        The returned ioctls field can contain the following bits:
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │This  user-space  API  seems not fully polished. Why │
> >>        │are there not constants defined for each of the bit- │
> >>        │mask values listed below?                            │
> >>        └─────────────────────────────────────────────────────┘
> >>
> >>        1 << _UFFDIO_API
> >>               The UFFDIO_API operation is supported.
> >>
> >>        1 << _UFFDIO_REGISTER
> >>               The UFFDIO_REGISTER operation is supported.
> >>
> >>        1 << _UFFDIO_UNREGISTER
> >>               The UFFDIO_UNREGISTER operation is supported.
> > 
> > Well, I tend to agree. I believe the original intention was to use the
> > OR'ed mask, like UFFD_API_IOCTLS.
> > Andrea, can you add somthing?
> 
> Yes, Andrea, please!
> 
> >>
> >>
> >>               ┌─────────────────────────────────────────────────────┐
> >>               │FIXME                                                │
> >>               ├─────────────────────────────────────────────────────┤
> >>               │Is  the above description of the 'ioctls' field cor‐ │
> >>               │rect?  Does more need to be said?                    │
> >>               │                                                     │
> >>               └─────────────────────────────────────────────────────┘
> > 
> > This is correct. I wouldn't add anything else.
> 
> Thanks.
> 
> >>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
> >>        and  errno  is set to indicate the cause of the error.  Possible errors
> >>        include:
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> > G>        ├─────────────────────────────────────────────────────┤
> >>        │Is the following error list correct?                 │
> >>        │                                                     │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > There's also -EFAULT in case copy_{from,to}_user fails.
> 
> Okay -- I have added that error.
> 
> >>
> >>        EINVAL The userfaultfd has already been  enabled  by  a  previous  UFF‐
> >>               DIO_API operation.
> >>
> >>        EINVAL The  API  version requested in the api field is not supported by
> >>               this kernel, or the features field was not zero.
> >>
> >>               ┌─────────────────────────────────────────────────────┐
> >>               │FIXME                                                │
> >>               ├─────────────────────────────────────────────────────┤
> >>               │In the above error case, the  returned  'uffdio_api' │
> >>               │structure  zeroed out. Why is this done? This should │
> >>               │be explained in the manual page.                     │
> >>               │                                                     │
> >>               └─────────────────────────────────────────────────────┘
> >  
> > In my understanding the uffdio_api structure is zeroed to allow the caller
> > to distinguish the reasons for -EINVAL.
> 
> Andrea, can you please help here?
> 
> 
> >>    UFFDIO_REGISTER
> >>        (Since Linux 4.3.)  Register a memory  address  range  with  the  user‐
> >>        faultfd  object.   The  argp argument is a pointer to a uffdio_register
> >>        structure, defined as:
> >>
> >>            struct uffdio_range {
> >>                __u64 start;    /* Start of range */
> >>                __u64 len;      /* Length of rnage (bytes) */
> >>            };
> >>
> >>            struct uffdio_register {
> >>                struct uffdio_range range;
> >>                __u64 mode;     /* Desired mode of operation (input) */
> >>                __u64 ioctls;   /* Available ioctl() operations (output) */
> >>            };
> >>
> >>
> >>        The range field defines a memory range starting at start and continuing
> >>        for len bytes that should be handled by the userfaultfd.
> >>
> >>        The  mode  field  defines the mode of operation desired for this memory
> >>        region.  The following values may be bitwise  ORed  to  set  the  user‐
> >>        faultfd mode for the specified range:
> >>
> >>        UFFDIO_REGISTER_MODE_MISSING
> >>               Track page faults on missing pages.
> >>
> >>        UFFDIO_REGISTER_MODE_WP
> >>               Track page faults on write-protected pages.
> >>
> >>        Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
> >>
> >>        If the operation is successful, the kernel modifies the ioctls bit-mask
> >>        field to indicate which ioctl(2) operations are available for the spec‐
> >>        ified range.  This returned bit mask is as for UFFDIO_API.
> >>
> >>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
> >>        and errno is set to indicate the cause of the error.   Possible  errors
> >>        include:
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Is the following error list correct?                 │
> >>        │                                                     │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
> > And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
> > mm_struct has gone by the time userfault grabs it. 
> 
> Okay -- added EFAULT. I think I'll skip ENOMEM for the moment, but
> will note the possibility in the page source.
> 
> >>        EBUSY  A  mapping  in  the  specified  range is registered with another
> >>               userfaultfd object.
> >>
> >>        EINVAL An invalid or unsupported bit was specified in the  mode  field;
> >>               or the mode field was zero.
> >>
> >>        EINVAL There is no mapping in the specified address range.
> >>
> >>        EINVAL range.start  or  range.len  is not a multiple of the system page
> >>               size; or, range.len is  zero;  or  these  fields  are  otherwise
> >>               invalid.
> >>
> >>        EINVAL There as an incompatible mapping in the specified address range.
> >>
> >>
> >>               ┌─────────────────────────────────────────────────────┐
> >>               │FIXME                                                │
> >>               ├─────────────────────────────────────────────────────┤
> >>               │Above: What does "incompatible" mean?                │
> >>               │                                                     │
> >>               └─────────────────────────────────────────────────────┘
> > 
> > Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
> > MAP_PRIVATE mappings.
> 
> Hmmm -- this restriction is not actually mentioned in the description
> of UFFDIO_REGISTER. So, at the start of the description of that operation, 
> I've made the text as follows:
> 
> [[
> .SS UFFDIO_REGISTER
> (Since Linux 4.3.)
> Register a memory address range with the userfaultfd object.
> The pages in the range must be "compatible".
> In the current implementation,
> .\" According to Mike Rapoport, this will change in Linux 4.11.
> only private anonymous ranges are compatible for registering with
> .BR UFFDIO_REGISTER .
> ]]
> 
> Okay?

Yes.
 
> >>    UFFDIO_UNREGISTER
> >>        (Since Linux 4.3.)  Unregister a memory address range from userfaultfd.
> >>        The address range to unregister is specified in the uffdio_range struc‐
> >>        ture pointed to by argp.
> >>
> >>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
> >>        and errno is set to indicate the cause of the error.   Possible  errors
> >>        include:
> >>
> >>        EINVAL Either  the  start or the len field of the ufdio_range structure
> >>               was not a multiple of the system page size; or the len field was
> >>               zero; or these fields were otherwise invalid.
> >>
> >>        EINVAL There as an incompatible mapping in the specified address range.
> >>
> >>
> >>               ┌─────────────────────────────────────────────────────┐
> >>               │FIXME                                                │
> >>               ├─────────────────────────────────────────────────────┤
> >>               │Above: What does "incompatible" mean?                │
> >>               └─────────────────────────────────────────────────────┘
> > 
> > The same comments as for UFFDIO_REGISTER apply here as well.
> 
> Okay. I changed the introductory text on UFFDIO_UNREGISTER to say:
> 
> [[
> .SS UFFDIO_UNREGISTER
> (Since Linux 4.3.)
> Unregister a memory address range from userfaultfd.
> The pages in the range must be "compatible" (see the description of
> .BR  UFFDIO_REGISTER .)
> ]]
> 
> Okay?

Yes.

> >>        EINVAL There was no mapping in the specified address range.
> >>
> >>    UFFDIO_COPY
> >>        (Since  Linux 4.3.)  Atomically copy a continuous memory chunk into the
> >>        userfault registered range and optionally wake up the  blocked  thread.
> >>        The  source  and  destination addresses and the number of bytes to copy
> >>        are specified by the src, dst, and len fields of the uffdio_copy struc‐
> >>        ture pointed to by argp:
> >>
> >>            struct uffdio_copy {
> >>                __u64 dst;    /* Source of copy */
> >>                __u64 src;    /* Destinate of copy */
> >>                __u64 len;    /* Number of bytes to copy */
> >>                __u64 mode;   /* Flags controlling behavior of copy */
> >>                __s64 copy;   /* Number of bytes copied, or negated error */
> >>            };
> >>
> >>        The  following value may be bitwise ORed in mode to change the behavior
> >>        of the UFFDIO_COPY operation:
> >>
> >>        UFFDIO_COPY_MODE_DONTWAKE
> >>               Do not wake up the thread that waits for page-fault resolution
> >>
> >>        The copy field is used by the kernel to return the number of bytes that
> >>        was actually copied, or an error (a negated errno-style value).
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Above:  Why is the 'copy' field used to return error │
> >>        │values?  This should  be  explained  in  the  manual │
> >>        │page.                                                │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Andrea, can you help with this one, please?
> 
> Yes, Andrea, please.
> 
> >>        If  the  value returned in copy doesn't match the value that was speci‐
> >>        fied in len, the operation fails with the error EAGAIN.  The copy field
> >>        is output-only; it is not read by the UFFDIO_COPY operation.
> >>
> >>        This ioctl(2) operation returns 0 on success.  In this case, the entire
> >>        area was copied.  On error, -1 is returned and errno is set to indicate
> >>        the cause of the error.  Possible errors include:
> >>
> >>        EAGAIN The number of bytes copied (i.e., the value returned in the copy
> >>               field) does not equal the value that was specified  in  the  len
> >>               field.
> >>
> >>        EINVAL Either dst or len was not a multiple of the system page size, or
> >>               the range specified by src and len or dst and len was invalid.
> >>
> >>        EINVAL An invalid bit was specified in the mode field.
> >>
> >>    UFFDIO_ZEROPAGE
> >>        (Since Linux 4.3.)  Zero out  a  memory  range  registered  with  user‐
> >>        faultfd.   The  requested  range is specified by the range field of the
> >>        uffdio_zeropage structure pointed to by argp:
> >>
> >>            struct uffdio_zeropage {
> >>                struct uffdio_range range;
> >>                __u64 mode;     /* Flags controlling behavior of copy */
> >>                __s64 zeropage; /* Number of bytes zeroed, or negated error */
> >>            };
> >>
> >>        The following value may be bitwise ORed in mode to change the  behavior
> >>        of the UFFDIO_ZERO operation:
> >>
> >>        UFFDIO_ZEROPAGE_MODE_DONTWAKE
> >>               Do not wake up the thread that waits for page-fault resolution.
> >>
> >>        The  zeropage field is used by the kernel to return the number of bytes
> >>        that was actually zeroed, or an  error  in  the  same  manner  as  UFF‐
> >>        DIO_COPY.
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Why  is  the  'zeropage'  field used to return error │
> >>        │values?  This should  be  explained  in  the  manual │
> >>        │page.                                                │
> >>        └─────────────────────────────────────────────────────┘
> 
> Help is still needed for this FIXME!

It would be pretty much the same as for the 'copy' field in uffdio_copy...
 
> >>        If  the  value  returned  in the zeropage field doesn't match the value
> >>        that was specified in range.len, the operation  fails  with  the  error
> >>        EAGAIN.   The zeropage field is output-only; it is not read by the UFF‐
> >>        DIO_ZERO operation.
> >>
> >>        This ioctl(2) operation returns 0 on success.  In this case, the entire
> >>        area was zeroed.  On error, -1 is returned and errno is set to indicate
> >>        the cause of the error.  Possible errors include:
> >>
> >>        EAGAIN The number of bytes zeroed (i.e.,  the  value  returned  in  the
> >>               zeropage  field)  does not equal the value that was specified in
> >>               the range.len field.
> >>
> >>        EINVAL Either range.start or range.len was not a multiple of the system
> >>               page  size;  or  range.len  was zero; or the range specified was
> >>               invalid.
> >>
> >>        EINVAL An invalid bit was specified in the mode field.
> >>
> >>    UFFDIO_WAKE
> >>        (Since Linux 4.3.)  Wake up the thread waiting for  page-fault  resolu‐
> >>        tion  on  a  specified  memory  address  range.  The argp argument is a
> >>        pointer to a uffdio_range structure (shown above)  that  specifies  the
> >>        address range.
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Need more detail here. How is the UFFDIO_WAKE opera‐ │
> >>        │tion used?                                           │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > The UFFDIO_WAKE operation is used in conjunction with
> > UFFDIO_{COPY,ZEROPAGE} operations that have
> > UFFDIO_{COPY,ZEROPAGE}_MODE_DONTWAKE bit set in the mode field.
> > The userfault monitor can perform several UFFDIO_{COPY,ZEROPAGE} calls in a
> > batch and then explicitly wake up the faulting thread using UFFDIO_WAKE.
> 
> Perfect! I've tweaked that text a little and added to the page.
> 
> >>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
> >>        and  errno  is set to indicate the cause of the error.  Possible errors
> >>        include:
> >>
> >>        EINVAL The start or the len field of the ufdio_range structure was  not
> >>               a  multiple  of  the  system  page size; or len was zero; or the
> >>               specified range was otherwise invalid.
> >>
> >> RETURN VALUE
> >>        See descriptions of the individual operations, above.
> >>
> >> ERRORS
> >>        See descriptions of the individual operations, above.  In addition, the
> >>        following  general errors can occur for all of the operations described
> >>        above:
> >>
> >>        EFAULT argp does not point to a valid memory address.
> >>
> >>        EINVAL (For all operations except UFFDIO_API.)  The userfaultfd  object
> >>               has not yet been enabled (via the UFFDIO_API operation).
> >>
> >> CONFORMING TO
> >>        These ioctl(2) operations are Linux-specific.
> >>
> >> EXAMPLE
> >>        See userfaultfd(2).
> >>
> >> SEE ALSO
> >>        ioctl(2), mmap(2), userfaultfd(2)
> >>
> >>        Documentation/vm/userfaultfd.txt in the Linux kernel source tree
> >>
> > 
> > [1] http://lxr.free-electrons.com/source/fs/userfaultfd.c#L1199
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/userfaultfd.c#n1680
> 
> The current version of the two pages has been pushed to 
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
> 
> Cheers,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft userfaultfd(2) manual page
  2017-04-21 11:06     ` Mike Rapoport
@ 2017-04-21 11:30       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-04-21 11:30 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: mtk.manpages, Andrea Arcangeli, lkml, linux-mm, linux-man

Hello Mike,

On 04/21/2017 01:06 PM, Mike Rapoport wrote:
> On Fri, Apr 21, 2017 at 08:30:55AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Mike,
>>
>> On 03/21/2017 03:01 PM, Mike Rapoport wrote:
>>> Hello Michael,
>>>
>>> On Mon, Mar 20, 2017 at 09:08:05PM +0100, Michael Kerrisk (man-pages) wrote:
>>>> Hello Andrea, Mike, and all,
>>>>
>>>> Mike: thanks for the page that you sent. I've reworked it
>>>> a bit, and also added a lot of further information,
>>>> and an example program. In the process, I split the page
>>>> into two pieces, with one piece describing the userfaultfd()
>>>> system call and the other describing the ioctl() operations.
>>>>
>>>> I'd like to get review input, especially from you and
>>>> Andrea, but also anyone else, for the current version
>>>> of this page, which includes a few FIXMEs to be sorted.
>>>
>>> Thanks for the update. I'm adressing the FIXME points you've mentioned
>>> below.
>>
>> Thanks!
>>
>>> Otherwise, everything seems the right description of the current upstream.
>>> 4.11 will have quite a few updates to userfault and we'll need to udpate
>>> this page and ioctl_userfaultfd(2) to address those updates. I am planning
>>> to work on the man update in the next few weeks. 
>>>  
>>>> I've shown the rendered version of the page below. 
>>>> The groff source is attached, and can also be found
>>>> at the branch here:
>>>  
>>>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>>>>
>>>> The new ioctl_userfaultfd(2) page follows this mail.
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>  
>>> --
>>> Sincerely yours,
>>> Mike. 
>>>  
>>>
>>>> USERFAULTFD(2)         Linux Programmer's Manual        USERFAULTFD(2)
>>>>
>>>> ┌─────────────────────────────────────────────────────┐
>>>> │FIXME                                                │
>>>> ├─────────────────────────────────────────────────────┤
>>>> │Need  to  describe close(2) semantics for userfaulfd │
>>>> │file descriptor: what happens when  the  userfaultfd │
>>>> │FD is closed?                                        │
>>>> │                                                     │
>>>> └─────────────────────────────────────────────────────┘
>>>  
>>> When userfaultfd is closed, it unregisters all memory ranges that were
>>> previously registered with it and flushes the outstanding page fault
>>> events.
>>
>> Presumably, this is more precisely stated as, "when the last
>> file descriptor referring to a userfaultfd object is closed..."?
> 
> You are right.

Thanks for the confirmation.

>> I've made the text:
>>
>>        When the last file descriptor referring to a userfaultfd object
>>        is  closed,  all  memory  ranges  that were registered with the
>>        object  are  unregistered  and  unread  page-fault  events  are
>>        flushed.
>>
>> [...]
> 
> Perfect.
>  

[...]

>>>>        Each read(2) from the userfaultfd file descriptor  returns  one
>>>>        or  more  uffd_msg  structures, each of which describes a page-
>>>>        fault event:
>>>>
>>>>            struct uffd_msg {
>>>>                __u8  event;                /* Type of event */
>>>>                ...
>>>>                union {
>>>>                    struct {
>>>>                        __u64 flags;        /* Flags describing fault */
>>>>                        __u64 address;      /* Faulting address */
>>>>                    } pagefault;
>>>>                    ...
>>>>                } arg;
>>>>
>>>>                /* Padding fields omitted */
>>>>            } __packed;
>>>>
>>>>        If multiple events are available and  the  supplied  buffer  is
>>>>        large enough, read(2) returns as many events as will fit in the
>>>>        supplied buffer.  If the buffer supplied to read(2) is  smaller
>>>>        than the size of the uffd_msg structure, the read(2) fails with
>>>>        the error EINVAL.
>>>>
>>>>        The fields set in the uffd_msg structure are as follows:
>>>>
>>>>        event  The type of event.  Currently, only one value can appear
>>>>               in  this  field: UFFD_EVENT_PAGEFAULT, which indicates a
>>>>               page-fault event.
>>>>
>>>>        address
>>>>               The address that triggered the page fault.
>>>>
>>>>        flags  A bit mask  of  flags  that  describe  the  event.   For
>>>>               UFFD_EVENT_PAGEFAULT, the following flag may appear:
>>>>
>>>>               UFFD_PAGEFAULT_FLAG_WRITE
>>>>                      If  the address is in a range that was registered
>>>>                      with the UFFDIO_REGISTER_MODE_MISSING  flag  (see
>>>>                      ioctl_userfaultfd(2))  and this flag is set, this
>>>>                      a write fault; otherwise it is a read fault.
>>>>
>>>>        A read(2) on a userfaultfd file descriptor can  fail  with  the
>>>>        following errors:
>>>>
>>>>        EINVAL The  userfaultfd  object  has not yet been enabled using
>>>>               the UFFDIO_API ioctl(2) operation
>>>>
>>>>        The userfaultfd file descriptor can be monitored with  poll(2),
>>>>        select(2),  and  epoll(7).  When events are available, the file
>>>>        descriptor indicates as readable.
>>>>
>>>>
>>>>        ┌─────────────────────────────────────────────────────┐
>>>>        │FIXME                                                │
>>>>        ├─────────────────────────────────────────────────────┤
>>>>        │But, it seems,  the  object  must  be  created  with │
>>>>        │O_NONBLOCK.  What is the rationale for this require‐ │
>>>>        │ment? Something needs to  be  said  in  this  manual │
>>>>        │page.                                                │
>>>>        └─────────────────────────────────────────────────────┘
>>>
>>> The object can be created without O_NONBLOCK, so probably the above
>>> sentence can be rephrased as:
>>>
>>> When the userfaultfd file descriptor is opened in non-blocking mode, it can
>>> be monitored with ...
>>
>> Yes, but why is there this requirement for poll() etc. with the
>> O_NONBLOCK flag? I think something about that needs to be said in the 
>> man page. Sorry, my FIXME was not clear enough. I've reworded the text 
>> and the FIXME:
>>
>>        If the O_NONBLOCK flag is enabled in the associated  open  file
>>        description,  the  userfaultfd file descriptor can be monitored
>>        with poll(2), select(2), and epoll(7).  When events are  avail‐
>>        able, the file descriptor indicates as readable.  If the O_NON‐
>>        BLOCK flag is not enabled, then poll(2) (always) indicates  the
>>        file as having a POLLERR condition, and select(2) indicates the
>>        file descriptor as both readable and writable.
>>
>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │What is the reason for this seemingly  odd  behavior │
>>        │with  respect  to  the  O_NONBLOCK  flag? (see user‐ │
>>        │faultfd_poll()  in   fs/userfaultfd.c).    Something │
>>        │needs to be said about this.                         │
>>        └─────────────────────────────────────────────────────┘
> 
> Andrea, can you please help with this one as well?

Let's see what Andrea has to say.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft ioctl_userfaultfd(2) manual page
  2017-04-21 11:07       ` Mike Rapoport
@ 2017-04-21 11:41         ` Michael Kerrisk (man-pages)
  2017-04-25  8:00           ` Mike Rapoport
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-04-21 11:41 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: mtk.manpages, Andrea Arcangeli, lkml, linux-mm, linux-man

Hi Mike,

On 04/21/2017 01:07 PM, Mike Rapoport wrote:
> Hello Michael,
> 
> On Fri, Apr 21, 2017 at 11:11:18AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Mike,
>> Hello Andrea (we need your help!),
>>
>> On 03/22/2017 02:54 PM, Mike Rapoport wrote:
>>> Hello Michael,
>>>
>>> On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
>>>> Hello Andrea, Mike, and all,
>>>>
>>>> Mike: here's the split out page that describes the 
>>>> userfaultfd ioctl() operations.
>>>>
>>>> I'd like to get review input, especially from you and
>>>> Andrea, but also anyone else, for the current version
>>>> of this page, which includes quite a few FIXMEs to be
>>>> sorted.
>>>>
>>>> I've shown the rendered version of the page below. 
>>>> The groff source is attached, and can also be found
>>>> at the branch here:
>>>>
>>>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>>>>
>>>> The new ioctl_userfaultfd(2) page follows this mail.
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>> NAME
>>>>        userfaultfd - create a file descriptor for handling page faults in user
>>>>        space
>>>>
>>>> SYNOPSIS
>>>>        #include <sys/ioctl.h>
>>>>
>>>>        int ioctl(int fd, int cmd, ...);
>>>>
>>>> DESCRIPTION
>>>>        Various ioctl(2) operations can be performed on  a  userfaultfd  object
>>>>        (created by a call to userfaultfd(2)) using calls of the form:
>>>>
>>>>            ioctl(fd, cmd, argp);
>>>>
>>>>        In  the  above,  fd  is  a  file  descriptor referring to a userfaultfd
>>>>        object, cmd is one of the commands listed below, and argp is a  pointer
>>>>        to a data structure that is specific to cmd.
>>>>
>>>>        The  various  ioctl(2) operations are described below.  The UFFDIO_API,
>>>>        UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
>>>>        userfaultfd behavior.  These operations allow the caller to choose what
>>>>        features will be enabled and what kinds of events will be delivered  to
>>>>        the application.  The remaining operations are range operations.  These
>>>>        operations enable the calling application to resolve page-fault  events
>>>>        in a consistent way.
>>>>
>>>>
>>>>        ┌─────────────────────────────────────────────────────┐
>>>>        │FIXME                                                │
>>>>        ├─────────────────────────────────────────────────────┤
>>>>        │Above: What does "consistent" mean?                  │
>>>>        │                                                     │
>>>>        └─────────────────────────────────────────────────────┘
>>>
>>> Andrea, can you please help with this one?
>>
>> Let's see what Andrea has to say.
> 
> Actually, I though I've copied this text from Andrea's docs, but now I've
> found out it was my wording and I really don't remember now what was my
> intention for "consistent" :)
> My guess is that I was thinking about atomicity of UFFDIO_COPY, or the fact
> that from the faulting thread perspective the page fault handling is the
> same whether it's done in kernel or via userfaultfd...
> That said, maybe it'd be better just to drop "in a consistent way".

Okay. Dropped.

>>>>    UFFDIO_API
>>>>        (Since Linux 4.3.)  Enable operation of the userfaultfd and perform API
>>>>        handshake.  The argp argument is a pointer to a  uffdio_api  structure,
>>>>        defined as:
>>>>
>>>>            struct uffdio_api {
>>>>                __u64 api;        /* Requested API version (input) */
>>>>                __u64 features;   /* Must be zero */
>>>>                __u64 ioctls;     /* Available ioctl() operations (output) */
>>>>            };
>>>>
>>>>        The  api  field  denotes  the API version requested by the application.
>>>>        Before the call, the features field must be initialized to zero.
>>>>
>>>>
>>>>        ┌─────────────────────────────────────────────────────┐
>>>>        │FIXME                                                │
>>>>        ├─────────────────────────────────────────────────────┤
>>>>        │Above: Why must the 'features' field be  initialized │
>>>>        │to zero?                                             │
>>>>        └─────────────────────────────────────────────────────┘
>>>
>>> Until 4.11 the only supported feature is delegation of missing page fault
>>> and the UFFDIO_FEATURES bitmask is 0.
>>
>> So, the thing that was not clear, but now I think I understand:
>> 'features' is an input field where one can ask about supported features
>> (but none are supported, before Linux 4.11). Is that correct?
> 
> Yes.

Thanks.

>> I've changed the text here to read:
>>
>>        Before the call, the features field must be  initialized
>>        to  zero.  In the future, it is intended that this field can be
>>        used to ask whether particular features are supported.
>>
>> Seem okay?
> 
> Yes.
> Just the future is only a week or two from today as we are at 4.11-rc7 :)

Yes, I understand :-). So of course there's a *lot* more
new stuff to document, right?

[...]

>>>>    UFFDIO_REGISTER

[...]

>>>>        EINVAL There as an incompatible mapping in the specified address range.
>>>>
>>>>
>>>>               ┌─────────────────────────────────────────────────────┐
>>>>               │FIXME                                                │
>>>>               ├─────────────────────────────────────────────────────┤
>>>>               │Above: What does "incompatible" mean?                │
>>>>               │                                                     │
>>>>               └─────────────────────────────────────────────────────┘
>>>
>>> Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
>>> MAP_PRIVATE mappings.
>>
>> Hmmm -- this restriction is not actually mentioned in the description
>> of UFFDIO_REGISTER. So, at the start of the description of that operation, 
>> I've made the text as follows:
>>
>> [[
>> .SS UFFDIO_REGISTER
>> (Since Linux 4.3.)
>> Register a memory address range with the userfaultfd object.
>> The pages in the range must be "compatible".
>> In the current implementation,
>> .\" According to Mike Rapoport, this will change in Linux 4.11.
>> only private anonymous ranges are compatible for registering with
>> .BR UFFDIO_REGISTER .
>> ]]
>>
>> Okay?
> 
> Yes.

Thanks for checking it.

>>>>    UFFDIO_UNREGISTER

[...]

>>>>        EINVAL There as an incompatible mapping in the specified address range.
>>>>
>>>>
>>>>               ┌─────────────────────────────────────────────────────┐
>>>>               │FIXME                                                │
>>>>               ├─────────────────────────────────────────────────────┤
>>>>               │Above: What does "incompatible" mean?                │
>>>>               └─────────────────────────────────────────────────────┘
>>>
>>> The same comments as for UFFDIO_REGISTER apply here as well.
>>
>> Okay. I changed the introductory text on UFFDIO_UNREGISTER to say:
>>
>> [[
>> .SS UFFDIO_UNREGISTER
>> (Since Linux 4.3.)
>> Unregister a memory address range from userfaultfd.
>> The pages in the range must be "compatible" (see the description of
>> .BR  UFFDIO_REGISTER .)
>> ]]
>>
>> Okay?
> 
> Yes.

Thanks.

[...]

The current version of the two pages has been pushed to 
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft ioctl_userfaultfd(2) manual page
  2017-04-21 11:41         ` Michael Kerrisk (man-pages)
@ 2017-04-25  8:00           ` Mike Rapoport
  2017-04-25 10:59             ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2017-04-25  8:00 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: Andrea Arcangeli, lkml, linux-mm, linux-man

Hello Michael,

On Fri, Apr 21, 2017 at 01:41:18PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Mike,
> 

[...]

> > 
> > Yes.
> > Just the future is only a week or two from today as we are at 4.11-rc7 :)
> 
> Yes, I understand :-). So of course there's a *lot* more
> new stuff to document, right?

I've started to add the description of the new functionality to both
userfaultfd.2 and ioctl_userfaultfd.2 and it's somewhat difficult for me to
decide how it would be better to split the information between these two
pages and what should be the pages internal structure.

I even thought about possibility of adding relatively comprehensive
description of userfaultfd as man7/userfaultfd.7 and then keeping the pages
in man2 relatively small, just with brief description of APIs and SEE ALSO
pointing to man7.

Any advise is highly appreciated.
 
> [...]

--
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft ioctl_userfaultfd(2) manual page
  2017-04-25  8:00           ` Mike Rapoport
@ 2017-04-25 10:59             ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2017-04-25 10:59 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Andrea Arcangeli, lkml, linux-mm, linux-man

Hi Mike,

On 25 April 2017 at 10:00, Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> Hello Michael,
>
> On Fri, Apr 21, 2017 at 01:41:18PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Mike,
>>
>
> [...]
>
>> >
>> > Yes.
>> > Just the future is only a week or two from today as we are at 4.11-rc7 :)
>>
>> Yes, I understand :-). So of course there's a *lot* more
>> new stuff to document, right?
>
> I've started to add the description of the new functionality to both
> userfaultfd.2 and ioctl_userfaultfd.2

Thanks for doing this!

> and it's somewhat difficult for me to
> decide how it would be better to split the information between these two
> pages and what should be the pages internal structure.
>
> I even thought about possibility of adding relatively comprehensive
> description of userfaultfd as man7/userfaultfd.7 and then keeping the pages
> in man2 relatively small, just with brief description of APIs and SEE ALSO
> pointing to man7.
>
> Any advise is highly appreciated.

I'm not averse to the notion of a userfaultfd.7 page, but it's a
little hard to advise because I'm not sure of the size and scope of
your planned changes.

In the meantime, I've merged the userfaultfd pages into master,
dropped the "draft" branch, and pushed the updates in master to Git.

Can you write your changes as a series of patches, and perhaps first
give a brief oultine of the proposed changes before getting too far
into the work? Then we could tweak the direction if needed.

Cheers,

Michael

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Review request: draft ioctl_userfaultfd(2) manual page
  2017-04-21  9:11     ` Michael Kerrisk (man-pages)
  2017-04-21 11:07       ` Mike Rapoport
@ 2017-05-03 21:46       ` Andrea Arcangeli
  1 sibling, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2017-05-03 21:46 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: Mike Rapoport, lkml, linux-mm, linux-man

On Fri, Apr 21, 2017 at 11:11:18AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Mike,
> Hello Andrea (we need your help!),

Sorry for not answering sooner! (I had a vacation last week)

> 
> On 03/22/2017 02:54 PM, Mike Rapoport wrote:
> >>        The  various  ioctl(2) operations are described below.  The UFFDIO_API,
> >>        UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
> >>        userfaultfd behavior.  These operations allow the caller to choose what
> >>        features will be enabled and what kinds of events will be delivered  to
> >>        the application.  The remaining operations are range operations.  These
> >>        operations enable the calling application to resolve page-fault  events
> >>        in a consistent way.
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Above: What does "consistent" mean?                  │
> >>        │                                                     │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Andrea, can you please help with this one?
> 
> Let's see what Andrea has to say.

I think it doesn't mean anything and I see you already removed it, fine!

> So, the thing that was not clear, but now I think I understand:
> 'features' is an input field where one can ask about supported features
> (but none are supported, before Linux 4.11). Is that correct? I've changed
> the text here to read:
> 
>        Before the call, the features field must be  initialized
>        to  zero.  In the future, it is intended that this field can be
>        used to ask whether particular features are supported.
> 
> Seem okay?

Yes, but in reality nothing has changed. Simply the kernels before
4.11 had no feature support at all.

===
       Starting from Linux 4.11, the features field can be used to ask
       whether particular features are supported and explicitly enable
       userfaultfd features that are disabled by default.  The kernel
       always reports all the available features in the features
       field.
=====	      

I would prefer if we removed this 4.11 difference here.

We should be able to describe it simply as:

"The features field can be used to ask whether particular features are
supported and explicitly enable userfaultfd features that are disabled
by default. The kernel always reports all the available features in
the features field."

The whole point of this feature flag thing, is so the app at runtime
can check if the feature is available and ask for it. The fact kernels
before 4.11 don't support any feature is a detail.

> > There's a check in uffdio_api call that the user is not trying to enable
> > any other functionality and it asserts that uffdio_api.featurs is zero [1].
> > Starting from 4.11 the features negotiation is different. Now uffdio_call
> > verifies that it can support features the application requested [2].
> 
> Okay.

I don't like the differentiation here between 4.11 and before, because
from user point of view nothing has changed.

I think this description is enough "       Since Linux 4.11, the
following feature bits may be set: " and no other mention of 4.11 is
needed in the manpage. It looks an unnecessary complication to the reader.

> 
> >>        The  kernel verifies that it can support the requested API version, and
> >>        sets the features and ioctls fields to bit masks representing  all  the
> >>        available features and the generic ioctl(2) operations available.  Cur‐
> >>        rently, zero (i.e., no feature bits) is placed in the  features  field.
> >>        The returned ioctls field can contain the following bits:
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │This  user-space  API  seems not fully polished. Why │
> >>        │are there not constants defined for each of the bit- │
> >>        │mask values listed below?                            │
> >>        └─────────────────────────────────────────────────────┘
> >>
> >>        1 << _UFFDIO_API
> >>               The UFFDIO_API operation is supported.
> >>
> >>        1 << _UFFDIO_REGISTER
> >>               The UFFDIO_REGISTER operation is supported.
> >>
> >>        1 << _UFFDIO_UNREGISTER
> >>               The UFFDIO_UNREGISTER operation is supported.
> > 
> > Well, I tend to agree. I believe the original intention was to use the
> > OR'ed mask, like UFFD_API_IOCTLS.
> > Andrea, can you add somthing?
> 
> Yes, Andrea, please!

I agree it can be polished, but that's not something the manpage can
fix... For now the above is correct.

So about the error retvals I reviewed the final manpage from git which
is the latest version.


> >>
> >>        EINVAL The userfaultfd has already been  enabled  by  a  previous  UFF‐
> >>               DIO_API operation.
> >>
> >>        EINVAL The  API  version requested in the api field is not supported by
> >>               this kernel, or the features field was not zero.
> >>
> >>               ┌─────────────────────────────────────────────────────┐
> >>               │FIXME                                                │
> >>               ├─────────────────────────────────────────────────────┤
> >>               │In the above error case, the  returned  'uffdio_api' │
> >>               │structure  zeroed out. Why is this done? This should │
> >>               │be explained in the manual page.                     │
> >>               │                                                     │
> >>               └─────────────────────────────────────────────────────┘
> >  
> > In my understanding the uffdio_api structure is zeroed to allow the caller
> > to distinguish the reasons for -EINVAL.
> 
> Andrea, can you please help here?

It is zeroed out just for robustness, it's a slow path. If userland by
mistake won't check -EINVAL but it checks uffdio_api.features to be
set or uffdio_api.ioctls or uffdio_api.api after the UFFDIO_API ioctl
returns, it will have a chance to catch the failure (it won't risk to
parse random uninitialized values at least).

I don't think it should be documented the uffdio_api is zeroed out or
if it is documented, we should say userland shouldn't depend on it and
it's done just for robustness.

The normal correct way to catch an error is to check -EINVAL, after
getting -EINVAL the contents of uffdio_api should be ignored by
userland.

> >>    UFFDIO_REGISTER
> >>        (Since Linux 4.3.)  Register a memory  address  range  with  the  user‐
> >>        faultfd  object.   The  argp argument is a pointer to a uffdio_register
> >>        structure, defined as:
> >>
> >>            struct uffdio_range {
> >>                __u64 start;    /* Start of range */
> >>                __u64 len;      /* Length of rnage (bytes) */
> >>            };
> >>
> >>            struct uffdio_register {
> >>                struct uffdio_range range;
> >>                __u64 mode;     /* Desired mode of operation (input) */
> >>                __u64 ioctls;   /* Available ioctl() operations (output) */
> >>            };
> >>
> >>
> >>        The range field defines a memory range starting at start and continuing
> >>        for len bytes that should be handled by the userfaultfd.
> >>
> >>        The  mode  field  defines the mode of operation desired for this memory
> >>        region.  The following values may be bitwise  ORed  to  set  the  user‐
> >>        faultfd mode for the specified range:
> >>
> >>        UFFDIO_REGISTER_MODE_MISSING
> >>               Track page faults on missing pages.
> >>
> >>        UFFDIO_REGISTER_MODE_WP
> >>               Track page faults on write-protected pages.
> >>
> >>        Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
> >>
> >>        If the operation is successful, the kernel modifies the ioctls bit-mask
> >>        field to indicate which ioctl(2) operations are available for the spec‐
> >>        ified range.  This returned bit mask is as for UFFDIO_API.
> >>
> >>        This ioctl(2) operation returns 0 on success.  On error, -1 is returned
> >>        and errno is set to indicate the cause of the error.   Possible  errors
> >>        include:
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Is the following error list correct?                 │
> >>        │                                                     │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
> > And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
> > mm_struct has gone by the time userfault grabs it. 
> 
> Okay -- added EFAULT. I think I'll skip ENOMEM for the moment, but
> will note the possibility in the page source.

There is also ENOMEM as result of split_vma failing, and it isn't the
cleanest thing that it means both real OOM or out of vmas
(mm->map_count >= sysctl_max_map_count, not real OOM) and that the
process is exiting or there isn't a single vma in the mm.

If there's one vma but it isn't in the range we return -EINVAL so we
could return probably -EINVAL if it's exiting or if there isn't a
single vma in the mm. And leave -ENOMEM for split_vma only.

In general -EINVAL is programmer error of some kind, -ENOMEM is
returned in memory related cases that trigger at runtime, however if
the process is exiting it's debatable if it's programmer error too
which is why I think we could return -EINVAL there.

I'd expect userland to threat -ENOMEM and -EINVAL about the same way.

> >>        EINVAL There was no mapping in the specified address range.
> >>
> >>    UFFDIO_COPY
> >>        (Since  Linux 4.3.)  Atomically copy a continuous memory chunk into the
> >>        userfault registered range and optionally wake up the  blocked  thread.
> >>        The  source  and  destination addresses and the number of bytes to copy
> >>        are specified by the src, dst, and len fields of the uffdio_copy struc‐
> >>        ture pointed to by argp:
> >>
> >>            struct uffdio_copy {
> >>                __u64 dst;    /* Source of copy */
> >>                __u64 src;    /* Destinate of copy */
> >>                __u64 len;    /* Number of bytes to copy */
> >>                __u64 mode;   /* Flags controlling behavior of copy */
> >>                __s64 copy;   /* Number of bytes copied, or negated error */
> >>            };
> >>
> >>        The  following value may be bitwise ORed in mode to change the behavior
> >>        of the UFFDIO_COPY operation:
> >>
> >>        UFFDIO_COPY_MODE_DONTWAKE
> >>               Do not wake up the thread that waits for page-fault resolution
> >>
> >>        The copy field is used by the kernel to return the number of bytes that
> >>        was actually copied, or an error (a negated errno-style value).
> >>
> >>
> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Above:  Why is the 'copy' field used to return error │
> >>        │values?  This should  be  explained  in  the  manual │
> >>        │page.                                                │
> >>        └─────────────────────────────────────────────────────┘
> > 
> > Andrea, can you help with this one, please?
> 
> Yes, Andrea, please.

Well not just for error values. copy also tells how much did it copy
if a signal made it return short with -EINTR or some other error
happened in the middle of the copy after we already did some
copying-progress.

Writing any other error into .copy (and not only writing positive
values there) is for robustness in case userland won't check the ioctl
retval but nobody should depend on it. After the ioctl returns an
error userland should not check the uffdio_copy structure at all.

The thing to document is the amount of bytes successfully copied in
uffdio_copy.copy (which may be a short copy and in turn must be
checked... unless one knows the len matches the arch PAGE_SIZE but
it's still safer to check the uffdio_copy.copy field and be
consistent).

One more relevant retvals for UFFDIO_COPY and UFFDIO_ZEROPAGE that I
noticed is missing in the current manpage, is
-EEXIST. UFFDIO_COPY/ZEROPAGE don't behave like mmap/mremap,
UFFDIO_COPY/ZEROPAGE will never teardown any existing established
mapping to guarantee even if the user has race condition in its code,
there's no risk of silent memory corruption, instead a meaningful
error is returned by UFFDIO_COPY/ZEROPAGE. For example if two
UFFDIO_COPY run concurrently on the same page, only the first one will
succeed, the second will return -EEXIST and only the first page will
be copied and no memory corruption can happen this way (furthermore
the programmer can be notified of the race condition in the code which
might even be intentional, but more likely is not).

> >>        ┌─────────────────────────────────────────────────────┐
> >>        │FIXME                                                │
> >>        ├─────────────────────────────────────────────────────┤
> >>        │Why  is  the  'zeropage'  field used to return error │
> >>        │values?  This should  be  explained  in  the  manual │
> >>        │page.                                                │
> >>        └─────────────────────────────────────────────────────┘
> 
> Help is still needed for this FIXME!

Same as uffdio_copy.copy: because we've to return the number of pages
that have been zeroed out in case we run into an error (signal -EINTR
or -EEXIST etc..) after we already succeeded partially on a couple of
pages. So we may as well write the error in the same field if no pages
were zeroed out. This way the program will misbehave more than if that
field is left untouched (and it could even be random in such case as
it can be left uninitialized by userland). Clearly it only makes a
difference if the programmer forgets to check the UFFDIO_ZEROPAGE
ioctl retval and the retval must always be checked, so again, this is
only for robustness.

Awesome manpage! Super useful, it's fundamental to have a manpage
especially when the syscall is not simple and strightforward in
functionality.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-05-03 21:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-20 20:08 Review request: draft userfaultfd(2) manual page Michael Kerrisk (man-pages)
2017-03-20 20:11 ` Review request: draft ioctl_userfaultfd(2) " Michael Kerrisk (man-pages)
2017-03-22 13:54   ` Mike Rapoport
2017-04-21  9:11     ` Michael Kerrisk (man-pages)
2017-04-21 11:07       ` Mike Rapoport
2017-04-21 11:41         ` Michael Kerrisk (man-pages)
2017-04-25  8:00           ` Mike Rapoport
2017-04-25 10:59             ` Michael Kerrisk (man-pages)
2017-05-03 21:46       ` Andrea Arcangeli
2017-03-21 14:01 ` Review request: draft userfaultfd(2) " Mike Rapoport
2017-04-21  6:30   ` Michael Kerrisk (man-pages)
2017-04-21 11:06     ` Mike Rapoport
2017-04-21 11:30       ` Michael Kerrisk (man-pages)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).