All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20  9:57 ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

This adds a new system call, epoll_mod_wait. It's described as below:

NAME
       epoll_mod_wait - modify and wait for I/O events on an epoll file
                        descriptor

SYNOPSIS

       int epoll_mod_wait(int epfd, int flags,
                          int ncmds, struct epoll_mod_cmd *cmds,
                          struct epoll_wait_spec *spec);

DESCRIPTION

       The epoll_mod_wait() system call can be seen as an enhanced combination
       of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
       call. It is superior in two cases:
       
       1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
       will save context switches between user mode and kernel mode;
       
       2) When you need higher precision than microsecond for wait timeout.

       The epoll_ctl(2) operations are embedded into this call by with ncmds
       and cmds. The latter is an array of command structs:

           struct epoll_mod_cmd {

                  /* Reserved flags for future extension, must be 0 for now. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code once this
                   * command is executed by kernel */
                  int error;
           };
       
       There is no guartantee that all the commands are executed in order. Only
       if all the commands are successfully executed (all the error fields are
       set to 0), events are polled.

       The last parameter "spec" is a pointer to struct epoll_wait_spec, which
       contains the information about how to poll the events. If it's NULL, this
       call will immediately return after running all the commands in cmds.

       The structure is defined as below:

           struct epoll_wait_spec {

                  /* The same as "maxevents" in epoll_pwait() */
                  int maxevents;

                  /* The same as "events" in epoll_pwait() */
                  struct epoll_event *events;

                  /* Which clock to use for timeout */
                  int clockid;

                  /* Maximum time to wait if there is no event */
                  struct timespec timeout;

                  /* The same as "sigmask" in epoll_pwait() */
                  sigset_t *sigmask;

                  /* The same as "sigsetsize" in epoll_pwait() */
                  size_t sigsetsize;
           } EPOLL_PACKED;

RETURN VALUE

       When any error occurs, epoll_mod_wait() returns -1 and errno is set
       appropriately. All the "error" fields in cmds are unchanged before they
       are executed, and if any cmds are executed, the "error" fields are set
       to a return code accordingly. See also epoll_ctl for more details of the
       return code.

       When successful, epoll_mod_wait() returns the number of file
       descriptors ready for the requested I/O, or zero if no file descriptor
       became ready during the requested timeout milliseconds.

       If spec is NULL, it returns 0 if all the commands are successful, and -1
       if an error occured.

ERRORS

       These errors apply on either the return value of epoll_mod_wait or error
       status for each command, respectively.

       EBADF  epfd or fd is not a valid file descriptor.

       EFAULT The memory area pointed to by events is not accessible with write
              permissions.

       EINTR  The call was interrupted by a signal handler before either any of
              the requested events occurred or the timeout expired; see
              signal(7).

       EINVAL epfd is not an epoll file descriptor, or maxevents is less than
              or equal to zero, or fd is the same as epfd, or the requested
              operation op is not supported by this interface.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
              already registered with this epoll instance.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
              with this epoll instance.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
              encountered while trying to register (EPOLL_CTL_ADD) a new file
              descriptor on an epoll instance.  See epoll(7) for further
              details.

       EPERM  The target file fd does not support epoll.

CONFORMING TO

       epoll_mod_wait() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (6):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Specify clockid explicitly
  epoll: Add definition for epoll_mod_wait structures
  epoll: Extract ep_ctl_do
  epoll: Add implementation for epoll_mod_wait
  x86: Hook up epoll_mod_wait syscall

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/eventpoll.c                   | 219 +++++++++++++++++++++++++--------------
 include/linux/syscalls.h         |   5 +
 include/uapi/linux/eventpoll.h   |  20 ++++
 5 files changed, 167 insertions(+), 79 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20  9:57 ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

This adds a new system call, epoll_mod_wait. It's described as below:

NAME
       epoll_mod_wait - modify and wait for I/O events on an epoll file
                        descriptor

SYNOPSIS

       int epoll_mod_wait(int epfd, int flags,
                          int ncmds, struct epoll_mod_cmd *cmds,
                          struct epoll_wait_spec *spec);

DESCRIPTION

       The epoll_mod_wait() system call can be seen as an enhanced combination
       of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
       call. It is superior in two cases:
       
       1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
       will save context switches between user mode and kernel mode;
       
       2) When you need higher precision than microsecond for wait timeout.

       The epoll_ctl(2) operations are embedded into this call by with ncmds
       and cmds. The latter is an array of command structs:

           struct epoll_mod_cmd {

                  /* Reserved flags for future extension, must be 0 for now. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code once this
                   * command is executed by kernel */
                  int error;
           };
       
       There is no guartantee that all the commands are executed in order. Only
       if all the commands are successfully executed (all the error fields are
       set to 0), events are polled.

       The last parameter "spec" is a pointer to struct epoll_wait_spec, which
       contains the information about how to poll the events. If it's NULL, this
       call will immediately return after running all the commands in cmds.

       The structure is defined as below:

           struct epoll_wait_spec {

                  /* The same as "maxevents" in epoll_pwait() */
                  int maxevents;

                  /* The same as "events" in epoll_pwait() */
                  struct epoll_event *events;

                  /* Which clock to use for timeout */
                  int clockid;

                  /* Maximum time to wait if there is no event */
                  struct timespec timeout;

                  /* The same as "sigmask" in epoll_pwait() */
                  sigset_t *sigmask;

                  /* The same as "sigsetsize" in epoll_pwait() */
                  size_t sigsetsize;
           } EPOLL_PACKED;

RETURN VALUE

       When any error occurs, epoll_mod_wait() returns -1 and errno is set
       appropriately. All the "error" fields in cmds are unchanged before they
       are executed, and if any cmds are executed, the "error" fields are set
       to a return code accordingly. See also epoll_ctl for more details of the
       return code.

       When successful, epoll_mod_wait() returns the number of file
       descriptors ready for the requested I/O, or zero if no file descriptor
       became ready during the requested timeout milliseconds.

       If spec is NULL, it returns 0 if all the commands are successful, and -1
       if an error occured.

ERRORS

       These errors apply on either the return value of epoll_mod_wait or error
       status for each command, respectively.

       EBADF  epfd or fd is not a valid file descriptor.

       EFAULT The memory area pointed to by events is not accessible with write
              permissions.

       EINTR  The call was interrupted by a signal handler before either any of
              the requested events occurred or the timeout expired; see
              signal(7).

       EINVAL epfd is not an epoll file descriptor, or maxevents is less than
              or equal to zero, or fd is the same as epfd, or the requested
              operation op is not supported by this interface.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
              already registered with this epoll instance.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
              with this epoll instance.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
              encountered while trying to register (EPOLL_CTL_ADD) a new file
              descriptor on an epoll instance.  See epoll(7) for further
              details.

       EPERM  The target file fd does not support epoll.

CONFORMING TO

       epoll_mod_wait() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (6):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Specify clockid explicitly
  epoll: Add definition for epoll_mod_wait structures
  epoll: Extract ep_ctl_do
  epoll: Add implementation for epoll_mod_wait
  x86: Hook up epoll_mod_wait syscall

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/eventpoll.c                   | 219 +++++++++++++++++++++++++--------------
 include/linux/syscalls.h         |   5 +
 include/uapi/linux/eventpoll.h   |  20 ++++
 5 files changed, 167 insertions(+), 79 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20  9:57 ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

This adds a new system call, epoll_mod_wait. It's described as below:

NAME
       epoll_mod_wait - modify and wait for I/O events on an epoll file
                        descriptor

SYNOPSIS

       int epoll_mod_wait(int epfd, int flags,
                          int ncmds, struct epoll_mod_cmd *cmds,
                          struct epoll_wait_spec *spec);

DESCRIPTION

       The epoll_mod_wait() system call can be seen as an enhanced combination
       of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
       call. It is superior in two cases:
       
       1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
       will save context switches between user mode and kernel mode;
       
       2) When you need higher precision than microsecond for wait timeout.

       The epoll_ctl(2) operations are embedded into this call by with ncmds
       and cmds. The latter is an array of command structs:

           struct epoll_mod_cmd {

                  /* Reserved flags for future extension, must be 0 for now. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code once this
                   * command is executed by kernel */
                  int error;
           };
       
       There is no guartantee that all the commands are executed in order. Only
       if all the commands are successfully executed (all the error fields are
       set to 0), events are polled.

       The last parameter "spec" is a pointer to struct epoll_wait_spec, which
       contains the information about how to poll the events. If it's NULL, this
       call will immediately return after running all the commands in cmds.

       The structure is defined as below:

           struct epoll_wait_spec {

                  /* The same as "maxevents" in epoll_pwait() */
                  int maxevents;

                  /* The same as "events" in epoll_pwait() */
                  struct epoll_event *events;

                  /* Which clock to use for timeout */
                  int clockid;

                  /* Maximum time to wait if there is no event */
                  struct timespec timeout;

                  /* The same as "sigmask" in epoll_pwait() */
                  sigset_t *sigmask;

                  /* The same as "sigsetsize" in epoll_pwait() */
                  size_t sigsetsize;
           } EPOLL_PACKED;

RETURN VALUE

       When any error occurs, epoll_mod_wait() returns -1 and errno is set
       appropriately. All the "error" fields in cmds are unchanged before they
       are executed, and if any cmds are executed, the "error" fields are set
       to a return code accordingly. See also epoll_ctl for more details of the
       return code.

       When successful, epoll_mod_wait() returns the number of file
       descriptors ready for the requested I/O, or zero if no file descriptor
       became ready during the requested timeout milliseconds.

       If spec is NULL, it returns 0 if all the commands are successful, and -1
       if an error occured.

ERRORS

       These errors apply on either the return value of epoll_mod_wait or error
       status for each command, respectively.

       EBADF  epfd or fd is not a valid file descriptor.

       EFAULT The memory area pointed to by events is not accessible with write
              permissions.

       EINTR  The call was interrupted by a signal handler before either any of
              the requested events occurred or the timeout expired; see
              signal(7).

       EINVAL epfd is not an epoll file descriptor, or maxevents is less than
              or equal to zero, or fd is the same as epfd, or the requested
              operation op is not supported by this interface.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
              already registered with this epoll instance.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
              with this epoll instance.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
              encountered while trying to register (EPOLL_CTL_ADD) a new file
              descriptor on an epoll instance.  See epoll(7) for further
              details.

       EPERM  The target file fd does not support epoll.

CONFORMING TO

       epoll_mod_wait() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (6):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Specify clockid explicitly
  epoll: Add definition for epoll_mod_wait structures
  epoll: Extract ep_ctl_do
  epoll: Add implementation for epoll_mod_wait
  x86: Hook up epoll_mod_wait syscall

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/eventpoll.c                   | 219 +++++++++++++++++++++++++--------------
 include/linux/syscalls.h         |   5 +
 include/uapi/linux/eventpoll.h   |  20 ++++
 5 files changed, 167 insertions(+), 79 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH RFC 1/6] epoll: Extract epoll_wait_do and epoll_pwait_do
  2015-01-20  9:57 ` Fam Zheng
  (?)
@ 2015-01-20  9:57   ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

In preparation of epoll_mod_wait, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 130 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 59 insertions(+), 71 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
 	return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
 }
 
-static inline struct timespec ep_set_mstimeout(long ms)
-{
-	struct timespec now, ts = {
-		.tv_sec = ms / MSEC_PER_SEC,
-		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
-	};
-
-	ktime_get_ts(&now);
-	return timespec_add_safe(now, ts);
-}
-
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller supplied
  *           event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
  * @events: Pointer to the userspace buffer where the ready events should be
  *          stored.
  * @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- *           milliseconds. If the @timeout is zero, the function will not block,
- *           while if the @timeout is less than zero, the function will block
- *           until at least one event has been retrieved (or an error
- *           occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation.  If 0, the
+ *           function will not block. If negative, the function will block until
+ *           at least one event has been retrieved (or an error occurred).
  *
  * Returns: Returns the number of ready events which have been fetched, or an
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout)
+		   int maxevents, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
 
-	if (timeout > 0) {
-		struct timespec end_time = ep_set_mstimeout(timeout);
-
-		slack = select_estimate_accuracy(&end_time);
-		to = &expires;
-		*to = timespec_to_ktime(end_time);
-	} else if (timeout == 0) {
+	if (!ktime_to_ns(timeout)) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
 		 * caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		timed_out = 1;
 		spin_lock_irqsave(&ep->lock, flags);
 		goto check_events;
+	} else if (ktime_to_ns(timeout) > 0) {
+		struct timespec now, end_time;
+
+		ktime_get_ts(&now);
+		end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+		slack = select_estimate_accuracy(&end_time);
+		to = &expires;
+		*to = timespec_to_ktime(end_time);
 	}
 
 fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
 	return error;
 }
 
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+				int maxevents, const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -2002,29 +1988,32 @@ error_fput:
 
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_pwait(2).
+ * part of the user space epoll_wait(2).
  */
-SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
-		size_t, sigsetsize)
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+				 int maxevents, ktime_t timeout,
+				 sigset_t *sigmask, size_t sigsetsize)
 {
 	int error;
-	sigset_t ksigmask, sigsaved;
+	sigset_t sigsaved;
 
 	/*
 	 * If the caller wants a certain signal mask to be set during the wait,
 	 * we apply it here.
 	 */
 	if (sigmask) {
-		if (sigsetsize != sizeof(sigset_t))
-			return -EINVAL;
-		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
-			return -EFAULT;
 		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
+		set_current_blocked(sigmask);
 	}
 
-	error = sys_epoll_wait(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2044,49 +2033,48 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 	return error;
 }
 
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_pwait(2).
+ */
+SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
+		size_t, sigsetsize)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	sigset_t ksigmask;
+
+	if (sigmask) {
+		if (sigsetsize != sizeof(sigset_t))
+			return -EINVAL;
+		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+			return -EFAULT;
+	}
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
-			struct epoll_event __user *, events,
-			int, maxevents, int, timeout,
-			const compat_sigset_t __user *, sigmask,
-			compat_size_t, sigsetsize)
+		       struct epoll_event __user *, events,
+		       int, maxevents, int, timeout,
+		       const compat_sigset_t __user *, sigmask,
+		       compat_size_t, sigsetsize)
 {
-	long err;
 	compat_sigset_t csigmask;
-	sigset_t ksigmask, sigsaved;
+	sigset_t ksigmask;
+	ktime_t kt = ms_to_ktime(timeout);
 
-	/*
-	 * If the caller wants a certain signal mask to be set during the wait,
-	 * we apply it here.
-	 */
 	if (sigmask) {
 		if (sigsetsize != sizeof(compat_sigset_t))
 			return -EINVAL;
 		if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
 			return -EFAULT;
 		sigset_from_compat(&ksigmask, &csigmask);
-		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
-	}
-
-	err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
-	/*
-	 * If we changed the signal mask, we need to restore the original one.
-	 * In case we've got a signal while waiting, we do not restore the
-	 * signal mask yet, and we allow do_signal() to deliver the signal on
-	 * the way back to userspace, before the signal mask is restored.
-	 */
-	if (sigmask) {
-		if (err == -EINTR) {
-			memcpy(&current->saved_sigmask, &sigsaved,
-			       sizeof(sigsaved));
-			set_restore_sigmask();
-		} else
-			set_current_blocked(&sigsaved);
 	}
 
-	return err;
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 1/6] epoll: Extract epoll_wait_do and epoll_pwait_do
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

In preparation of epoll_mod_wait, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 130 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 59 insertions(+), 71 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
 	return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
 }
 
-static inline struct timespec ep_set_mstimeout(long ms)
-{
-	struct timespec now, ts = {
-		.tv_sec = ms / MSEC_PER_SEC,
-		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
-	};
-
-	ktime_get_ts(&now);
-	return timespec_add_safe(now, ts);
-}
-
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller supplied
  *           event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
  * @events: Pointer to the userspace buffer where the ready events should be
  *          stored.
  * @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- *           milliseconds. If the @timeout is zero, the function will not block,
- *           while if the @timeout is less than zero, the function will block
- *           until at least one event has been retrieved (or an error
- *           occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation.  If 0, the
+ *           function will not block. If negative, the function will block until
+ *           at least one event has been retrieved (or an error occurred).
  *
  * Returns: Returns the number of ready events which have been fetched, or an
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout)
+		   int maxevents, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
 
-	if (timeout > 0) {
-		struct timespec end_time = ep_set_mstimeout(timeout);
-
-		slack = select_estimate_accuracy(&end_time);
-		to = &expires;
-		*to = timespec_to_ktime(end_time);
-	} else if (timeout == 0) {
+	if (!ktime_to_ns(timeout)) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
 		 * caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		timed_out = 1;
 		spin_lock_irqsave(&ep->lock, flags);
 		goto check_events;
+	} else if (ktime_to_ns(timeout) > 0) {
+		struct timespec now, end_time;
+
+		ktime_get_ts(&now);
+		end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+		slack = select_estimate_accuracy(&end_time);
+		to = &expires;
+		*to = timespec_to_ktime(end_time);
 	}
 
 fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
 	return error;
 }
 
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+				int maxevents, const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -2002,29 +1988,32 @@ error_fput:
 
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_pwait(2).
+ * part of the user space epoll_wait(2).
  */
-SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
-		size_t, sigsetsize)
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+				 int maxevents, ktime_t timeout,
+				 sigset_t *sigmask, size_t sigsetsize)
 {
 	int error;
-	sigset_t ksigmask, sigsaved;
+	sigset_t sigsaved;
 
 	/*
 	 * If the caller wants a certain signal mask to be set during the wait,
 	 * we apply it here.
 	 */
 	if (sigmask) {
-		if (sigsetsize != sizeof(sigset_t))
-			return -EINVAL;
-		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
-			return -EFAULT;
 		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
+		set_current_blocked(sigmask);
 	}
 
-	error = sys_epoll_wait(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2044,49 +2033,48 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 	return error;
 }
 
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_pwait(2).
+ */
+SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
+		size_t, sigsetsize)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	sigset_t ksigmask;
+
+	if (sigmask) {
+		if (sigsetsize != sizeof(sigset_t))
+			return -EINVAL;
+		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+			return -EFAULT;
+	}
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
-			struct epoll_event __user *, events,
-			int, maxevents, int, timeout,
-			const compat_sigset_t __user *, sigmask,
-			compat_size_t, sigsetsize)
+		       struct epoll_event __user *, events,
+		       int, maxevents, int, timeout,
+		       const compat_sigset_t __user *, sigmask,
+		       compat_size_t, sigsetsize)
 {
-	long err;
 	compat_sigset_t csigmask;
-	sigset_t ksigmask, sigsaved;
+	sigset_t ksigmask;
+	ktime_t kt = ms_to_ktime(timeout);
 
-	/*
-	 * If the caller wants a certain signal mask to be set during the wait,
-	 * we apply it here.
-	 */
 	if (sigmask) {
 		if (sigsetsize != sizeof(compat_sigset_t))
 			return -EINVAL;
 		if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
 			return -EFAULT;
 		sigset_from_compat(&ksigmask, &csigmask);
-		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
-	}
-
-	err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
-	/*
-	 * If we changed the signal mask, we need to restore the original one.
-	 * In case we've got a signal while waiting, we do not restore the
-	 * signal mask yet, and we allow do_signal() to deliver the signal on
-	 * the way back to userspace, before the signal mask is restored.
-	 */
-	if (sigmask) {
-		if (err == -EINTR) {
-			memcpy(&current->saved_sigmask, &sigsaved,
-			       sizeof(sigsaved));
-			set_restore_sigmask();
-		} else
-			set_current_blocked(&sigsaved);
 	}
 
-	return err;
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 1/6] epoll: Extract epoll_wait_do and epoll_pwait_do
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

In preparation of epoll_mod_wait, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 130 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 59 insertions(+), 71 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
 	return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
 }
 
-static inline struct timespec ep_set_mstimeout(long ms)
-{
-	struct timespec now, ts = {
-		.tv_sec = ms / MSEC_PER_SEC,
-		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
-	};
-
-	ktime_get_ts(&now);
-	return timespec_add_safe(now, ts);
-}
-
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller supplied
  *           event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
  * @events: Pointer to the userspace buffer where the ready events should be
  *          stored.
  * @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- *           milliseconds. If the @timeout is zero, the function will not block,
- *           while if the @timeout is less than zero, the function will block
- *           until at least one event has been retrieved (or an error
- *           occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation.  If 0, the
+ *           function will not block. If negative, the function will block until
+ *           at least one event has been retrieved (or an error occurred).
  *
  * Returns: Returns the number of ready events which have been fetched, or an
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout)
+		   int maxevents, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
 
-	if (timeout > 0) {
-		struct timespec end_time = ep_set_mstimeout(timeout);
-
-		slack = select_estimate_accuracy(&end_time);
-		to = &expires;
-		*to = timespec_to_ktime(end_time);
-	} else if (timeout == 0) {
+	if (!ktime_to_ns(timeout)) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
 		 * caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		timed_out = 1;
 		spin_lock_irqsave(&ep->lock, flags);
 		goto check_events;
+	} else if (ktime_to_ns(timeout) > 0) {
+		struct timespec now, end_time;
+
+		ktime_get_ts(&now);
+		end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+		slack = select_estimate_accuracy(&end_time);
+		to = &expires;
+		*to = timespec_to_ktime(end_time);
 	}
 
 fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
 	return error;
 }
 
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+				int maxevents, const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -2002,29 +1988,32 @@ error_fput:
 
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_pwait(2).
+ * part of the user space epoll_wait(2).
  */
-SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
-		size_t, sigsetsize)
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+				 int maxevents, ktime_t timeout,
+				 sigset_t *sigmask, size_t sigsetsize)
 {
 	int error;
-	sigset_t ksigmask, sigsaved;
+	sigset_t sigsaved;
 
 	/*
 	 * If the caller wants a certain signal mask to be set during the wait,
 	 * we apply it here.
 	 */
 	if (sigmask) {
-		if (sigsetsize != sizeof(sigset_t))
-			return -EINVAL;
-		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
-			return -EFAULT;
 		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
+		set_current_blocked(sigmask);
 	}
 
-	error = sys_epoll_wait(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2044,49 +2033,48 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 	return error;
 }
 
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_pwait(2).
+ */
+SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
+		size_t, sigsetsize)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	sigset_t ksigmask;
+
+	if (sigmask) {
+		if (sigsetsize != sizeof(sigset_t))
+			return -EINVAL;
+		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+			return -EFAULT;
+	}
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
-			struct epoll_event __user *, events,
-			int, maxevents, int, timeout,
-			const compat_sigset_t __user *, sigmask,
-			compat_size_t, sigsetsize)
+		       struct epoll_event __user *, events,
+		       int, maxevents, int, timeout,
+		       const compat_sigset_t __user *, sigmask,
+		       compat_size_t, sigsetsize)
 {
-	long err;
 	compat_sigset_t csigmask;
-	sigset_t ksigmask, sigsaved;
+	sigset_t ksigmask;
+	ktime_t kt = ms_to_ktime(timeout);
 
-	/*
-	 * If the caller wants a certain signal mask to be set during the wait,
-	 * we apply it here.
-	 */
 	if (sigmask) {
 		if (sigsetsize != sizeof(compat_sigset_t))
 			return -EINVAL;
 		if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
 			return -EFAULT;
 		sigset_from_compat(&ksigmask, &csigmask);
-		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
-	}
-
-	err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
-	/*
-	 * If we changed the signal mask, we need to restore the original one.
-	 * In case we've got a signal while waiting, we do not restore the
-	 * signal mask yet, and we allow do_signal() to deliver the signal on
-	 * the way back to userspace, before the signal mask is restored.
-	 */
-	if (sigmask) {
-		if (err == -EINTR) {
-			memcpy(&current->saved_sigmask, &sigsaved,
-			       sizeof(sigsaved));
-			set_restore_sigmask();
-		} else
-			set_current_blocked(&sigsaved);
 	}
 
-	return err;
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 2/6] epoll: Specify clockid explicitly
  2015-01-20  9:57 ` Fam Zheng
  (?)
@ 2015-01-20  9:57   ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..6da143f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, const ktime_t timeout)
+		   int maxevents, int clockid, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1624,7 +1624,8 @@ fetch_events:
 			}
 
 			spin_unlock_irqrestore(&ep->lock, flags);
-			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+			if (!schedule_hrtimeout_range_clock(to, slack,
+						HRTIMER_MODE_ABS, clockid))
 				timed_out = 1;
 
 			spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1946,8 @@ error_return:
 }
 
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
-				int maxevents, const ktime_t timeout)
+				int maxevents, int clockid, 
+				const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -1979,7 +1981,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 	ep = f.file->private_data;
 
 	/* Time to fish for events ... */
-	error = ep_poll(ep, events, maxevents, timeout);
+	error = ep_poll(ep, events, maxevents, clockid, timeout);
 
 error_fput:
 	fdput(f);
@@ -1994,12 +1996,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 		int, maxevents, int, timeout)
 {
 	ktime_t kt = ms_to_ktime(timeout);
-	return epoll_wait_do(epfd, events, maxevents, kt);
+	return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
 }
 
 static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
-				 int maxevents, ktime_t timeout,
-				 sigset_t *sigmask, size_t sigsetsize)
+				 int maxevents,
+				 int clockid, ktime_t timeout,
+				 sigset_t *sigmask)
 {
 	int error;
 	sigset_t sigsaved;
@@ -2013,7 +2016,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
 		set_current_blocked(sigmask);
 	}
 
-	error = epoll_wait_do(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2053,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
 			return -EFAULT;
 	}
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 
 #ifdef CONFIG_COMPAT
@@ -2073,8 +2076,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		sigset_from_compat(&ksigmask, &csigmask);
 	}
 
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 2/6] epoll: Specify clockid explicitly
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..6da143f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, const ktime_t timeout)
+		   int maxevents, int clockid, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1624,7 +1624,8 @@ fetch_events:
 			}
 
 			spin_unlock_irqrestore(&ep->lock, flags);
-			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+			if (!schedule_hrtimeout_range_clock(to, slack,
+						HRTIMER_MODE_ABS, clockid))
 				timed_out = 1;
 
 			spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1946,8 @@ error_return:
 }
 
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
-				int maxevents, const ktime_t timeout)
+				int maxevents, int clockid, 
+				const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -1979,7 +1981,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 	ep = f.file->private_data;
 
 	/* Time to fish for events ... */
-	error = ep_poll(ep, events, maxevents, timeout);
+	error = ep_poll(ep, events, maxevents, clockid, timeout);
 
 error_fput:
 	fdput(f);
@@ -1994,12 +1996,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 		int, maxevents, int, timeout)
 {
 	ktime_t kt = ms_to_ktime(timeout);
-	return epoll_wait_do(epfd, events, maxevents, kt);
+	return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
 }
 
 static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
-				 int maxevents, ktime_t timeout,
-				 sigset_t *sigmask, size_t sigsetsize)
+				 int maxevents,
+				 int clockid, ktime_t timeout,
+				 sigset_t *sigmask)
 {
 	int error;
 	sigset_t sigsaved;
@@ -2013,7 +2016,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
 		set_current_blocked(sigmask);
 	}
 
-	error = epoll_wait_do(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2053,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
 			return -EFAULT;
 	}
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 
 #ifdef CONFIG_COMPAT
@@ -2073,8 +2076,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		sigset_from_compat(&ksigmask, &csigmask);
 	}
 
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 #endif
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 2/6] epoll: Specify clockid explicitly
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..6da143f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, const ktime_t timeout)
+		   int maxevents, int clockid, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1624,7 +1624,8 @@ fetch_events:
 			}
 
 			spin_unlock_irqrestore(&ep->lock, flags);
-			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+			if (!schedule_hrtimeout_range_clock(to, slack,
+						HRTIMER_MODE_ABS, clockid))
 				timed_out = 1;
 
 			spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1946,8 @@ error_return:
 }
 
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
-				int maxevents, const ktime_t timeout)
+				int maxevents, int clockid, 
+				const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -1979,7 +1981,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 	ep = f.file->private_data;
 
 	/* Time to fish for events ... */
-	error = ep_poll(ep, events, maxevents, timeout);
+	error = ep_poll(ep, events, maxevents, clockid, timeout);
 
 error_fput:
 	fdput(f);
@@ -1994,12 +1996,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 		int, maxevents, int, timeout)
 {
 	ktime_t kt = ms_to_ktime(timeout);
-	return epoll_wait_do(epfd, events, maxevents, kt);
+	return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
 }
 
 static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
-				 int maxevents, ktime_t timeout,
-				 sigset_t *sigmask, size_t sigsetsize)
+				 int maxevents,
+				 int clockid, ktime_t timeout,
+				 sigset_t *sigmask)
 {
 	int error;
 	sigset_t sigsaved;
@@ -2013,7 +2016,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
 		set_current_blocked(sigmask);
 	}
 
-	error = epoll_wait_do(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2053,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
 			return -EFAULT;
 	}
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 
 #ifdef CONFIG_COMPAT
@@ -2073,8 +2076,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		sigset_from_compat(&ksigmask, &csigmask);
 	}
 
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 #endif
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 3/6] epoll: Add definition for epoll_mod_wait structures
  2015-01-20  9:57 ` Fam Zheng
  (?)
@ 2015-01-20  9:57   ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

Two structs involved in the coming syscall is defined. Flags in epoll_mod_cmd
are reserved, which makes better word alignment and may allow future extension.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/uapi/linux/eventpoll.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..e32a804 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
 #include <linux/fcntl.h>
 #include <linux/types.h>
 
+#include <linux/signal.h>
+
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
 
@@ -61,6 +63,24 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+struct epoll_mod_cmd {
+	int flags;
+	int op;
+	int fd;
+	__u32 events;
+	__u64 data;
+	int error;
+} EPOLL_PACKED;
+
+struct epoll_wait_spec {
+	int maxevents;
+	struct epoll_event *events;
+	int clockid;
+	struct timespec timeout;
+	sigset_t *sigmask;
+	size_t sigsetsize;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 3/6] epoll: Add definition for epoll_mod_wait structures
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

Two structs involved in the coming syscall is defined. Flags in epoll_mod_cmd
are reserved, which makes better word alignment and may allow future extension.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/uapi/linux/eventpoll.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..e32a804 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
 #include <linux/fcntl.h>
 #include <linux/types.h>
 
+#include <linux/signal.h>
+
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
 
@@ -61,6 +63,24 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+struct epoll_mod_cmd {
+	int flags;
+	int op;
+	int fd;
+	__u32 events;
+	__u64 data;
+	int error;
+} EPOLL_PACKED;
+
+struct epoll_wait_spec {
+	int maxevents;
+	struct epoll_event *events;
+	int clockid;
+	struct timespec timeout;
+	sigset_t *sigmask;
+	size_t sigsetsize;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 3/6] epoll: Add definition for epoll_mod_wait structures
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

Two structs involved in the coming syscall is defined. Flags in epoll_mod_cmd
are reserved, which makes better word alignment and may allow future extension.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/uapi/linux/eventpoll.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..e32a804 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
 #include <linux/fcntl.h>
 #include <linux/types.h>
 
+#include <linux/signal.h>
+
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
 
@@ -61,6 +63,24 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+struct epoll_mod_cmd {
+	int flags;
+	int op;
+	int fd;
+	__u32 events;
+	__u64 data;
+	int error;
+} EPOLL_PACKED;
+
+struct epoll_wait_spec {
+	int maxevents;
+	struct epoll_event *events;
+	int clockid;
+	struct timespec timeout;
+	sigset_t *sigmask;
+	size_t sigsetsize;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 4/6] epoll: Extract ep_ctl_do
  2015-01-20  9:57 ` Fam Zheng
  (?)
@ 2015-01-20  9:57   ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_mod_wait.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6da143f..e7a116d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1808,22 +1808,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
 {
 	int error;
 	int full_check = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
 
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
 	error = -EBADF;
 	f = fdget(epfd);
 	if (!f.file)
@@ -1945,6 +1938,23 @@ error_return:
 	return error;
 }
 
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	struct epoll_event epds;
+
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		return -EFAULT;
+
+	return ep_ctl_do(epfd, op, fd, epds);
+}
+
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 				int maxevents, int clockid, 
 				const ktime_t timeout)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 4/6] epoll: Extract ep_ctl_do
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_mod_wait.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6da143f..e7a116d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1808,22 +1808,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
 {
 	int error;
 	int full_check = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
 
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
 	error = -EBADF;
 	f = fdget(epfd);
 	if (!f.file)
@@ -1945,6 +1938,23 @@ error_return:
 	return error;
 }
 
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	struct epoll_event epds;
+
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		return -EFAULT;
+
+	return ep_ctl_do(epfd, op, fd, epds);
+}
+
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 				int maxevents, int clockid, 
 				const ktime_t timeout)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 4/6] epoll: Extract ep_ctl_do
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_mod_wait.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6da143f..e7a116d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1808,22 +1808,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
 {
 	int error;
 	int full_check = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
 
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
 	error = -EBADF;
 	f = fdget(epfd);
 	if (!f.file)
@@ -1945,6 +1938,23 @@ error_return:
 	return error;
 }
 
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	struct epoll_event epds;
+
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		return -EFAULT;
+
+	return ep_ctl_do(epfd, op, fd, epds);
+}
+
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 				int maxevents, int clockid, 
 				const ktime_t timeout)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
  2015-01-20  9:57 ` Fam Zheng
  (?)
@ 2015-01-20  9:57   ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

This syscall is a sequence of

1) a number of epoll_ctl calls
2) a epoll_pwait, with timeout enhancement.

The epoll_ctl operations are embeded so that application doesn't have to use
separate syscalls to insert/delete/update the fds before poll. It is more
efficient if the set of fds varies from one poll to another, which is the
common pattern for certain applications. For example, depending on the input
buffer status, a data reading program may decide to temporarily not polling an
fd.

Because the enablement of batching in this interface, even that regular
epoll_ctl call sequence, which manipulates several fds, can be optimized to one
single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).

The only complexity is returning the result of each operation.  For each
epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
the return code *iff* the command is executed (0 for success and -errno of the
equivalent epoll_ctl call), and will be left unchanged if the command is not
executed because some earlier error, for example due to failure of
copy_from_user to copy the array.

Applications can utilize this fact to do error handling: they could initialize
all the epoll_mod_wait.error to a positive value, which is by definition not a
possible output value from epoll_mod_wait. Then when the syscall returned, they
know whether or not the command is executed by comparing each error with the
init value, if they're different, they have the result of the command.
More roughly, they can put any non-zero and not distinguish "not run" from
failure.

Also, timeout parameter is enhanced: timespec is used, compared to the old ms
scalar. This provides higher precision. The parameter field in struct
epoll_wait_spec, "clockid", also makes it possible for users to use a different
clock than the default when it makes more sense.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h |  5 ++++
 2 files changed, 65 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e7a116d..2cc22c9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL);
 }
 
+SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
+		int, ncmds, struct epoll_mod_cmd __user *, cmds,
+		struct epoll_wait_spec __user *, spec)
+{
+	struct epoll_mod_cmd *kcmds = NULL;
+	int i, ret = 0;
+	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
+
+	if (flags)
+		return -EINVAL;
+	if (ncmds) {
+		if (!cmds)
+			return -EINVAL;
+		kcmds = kmalloc(cmd_size, GFP_KERNEL);
+		if (!kcmds)
+			return -ENOMEM;
+		if (copy_from_user(kcmds, cmds, cmd_size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+	}
+	for (i = 0; i < ncmds; i++) {
+		struct epoll_event ev = (struct epoll_event) {
+			.events = kcmds[i].events,
+			.data = kcmds[i].data,
+		};
+		if (kcmds[i].flags) {
+			kcmds[i].error = ret = -EINVAL;
+			goto out;
+		}
+		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
+		if (ret)
+			goto out;
+	}
+	if (spec) {
+		sigset_t ksigmask;
+		struct epoll_wait_spec kspec;
+		ktime_t timeout;
+
+		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
+			return -EFAULT;
+		if (kspec.sigmask) {
+			if (kspec.sigsetsize != sizeof(sigset_t))
+				return -EINVAL;
+			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
+				return -EFAULT;
+		}
+		timeout = timespec_to_ktime(kspec.timeout);
+		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
+				     kspec.clockid, timeout,
+				     kspec.sigmask ? &ksigmask : NULL);
+	}
+
+out:
+	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
+		return -EFAULT;
+	kfree(kcmds);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..7156c80 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,8 @@
 #define _LINUX_SYSCALLS_H
 
 struct epoll_event;
+struct epoll_mod_cmd;
+struct epoll_wait_spec;
 struct iattr;
 struct inode;
 struct iocb;
@@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
+				   int ncmds, struct epoll_mod_cmd __user * cmds,
+				   struct epoll_wait_spec __user * spec);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

This syscall is a sequence of

1) a number of epoll_ctl calls
2) a epoll_pwait, with timeout enhancement.

The epoll_ctl operations are embeded so that application doesn't have to use
separate syscalls to insert/delete/update the fds before poll. It is more
efficient if the set of fds varies from one poll to another, which is the
common pattern for certain applications. For example, depending on the input
buffer status, a data reading program may decide to temporarily not polling an
fd.

Because the enablement of batching in this interface, even that regular
epoll_ctl call sequence, which manipulates several fds, can be optimized to one
single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).

The only complexity is returning the result of each operation.  For each
epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
the return code *iff* the command is executed (0 for success and -errno of the
equivalent epoll_ctl call), and will be left unchanged if the command is not
executed because some earlier error, for example due to failure of
copy_from_user to copy the array.

Applications can utilize this fact to do error handling: they could initialize
all the epoll_mod_wait.error to a positive value, which is by definition not a
possible output value from epoll_mod_wait. Then when the syscall returned, they
know whether or not the command is executed by comparing each error with the
init value, if they're different, they have the result of the command.
More roughly, they can put any non-zero and not distinguish "not run" from
failure.

Also, timeout parameter is enhanced: timespec is used, compared to the old ms
scalar. This provides higher precision. The parameter field in struct
epoll_wait_spec, "clockid", also makes it possible for users to use a different
clock than the default when it makes more sense.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h |  5 ++++
 2 files changed, 65 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e7a116d..2cc22c9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL);
 }
 
+SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
+		int, ncmds, struct epoll_mod_cmd __user *, cmds,
+		struct epoll_wait_spec __user *, spec)
+{
+	struct epoll_mod_cmd *kcmds = NULL;
+	int i, ret = 0;
+	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
+
+	if (flags)
+		return -EINVAL;
+	if (ncmds) {
+		if (!cmds)
+			return -EINVAL;
+		kcmds = kmalloc(cmd_size, GFP_KERNEL);
+		if (!kcmds)
+			return -ENOMEM;
+		if (copy_from_user(kcmds, cmds, cmd_size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+	}
+	for (i = 0; i < ncmds; i++) {
+		struct epoll_event ev = (struct epoll_event) {
+			.events = kcmds[i].events,
+			.data = kcmds[i].data,
+		};
+		if (kcmds[i].flags) {
+			kcmds[i].error = ret = -EINVAL;
+			goto out;
+		}
+		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
+		if (ret)
+			goto out;
+	}
+	if (spec) {
+		sigset_t ksigmask;
+		struct epoll_wait_spec kspec;
+		ktime_t timeout;
+
+		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
+			return -EFAULT;
+		if (kspec.sigmask) {
+			if (kspec.sigsetsize != sizeof(sigset_t))
+				return -EINVAL;
+			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
+				return -EFAULT;
+		}
+		timeout = timespec_to_ktime(kspec.timeout);
+		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
+				     kspec.clockid, timeout,
+				     kspec.sigmask ? &ksigmask : NULL);
+	}
+
+out:
+	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
+		return -EFAULT;
+	kfree(kcmds);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..7156c80 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,8 @@
 #define _LINUX_SYSCALLS_H
 
 struct epoll_event;
+struct epoll_mod_cmd;
+struct epoll_wait_spec;
 struct iattr;
 struct inode;
 struct iocb;
@@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
+				   int ncmds, struct epoll_mod_cmd __user * cmds,
+				   struct epoll_wait_spec __user * spec);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

This syscall is a sequence of

1) a number of epoll_ctl calls
2) a epoll_pwait, with timeout enhancement.

The epoll_ctl operations are embeded so that application doesn't have to use
separate syscalls to insert/delete/update the fds before poll. It is more
efficient if the set of fds varies from one poll to another, which is the
common pattern for certain applications. For example, depending on the input
buffer status, a data reading program may decide to temporarily not polling an
fd.

Because the enablement of batching in this interface, even that regular
epoll_ctl call sequence, which manipulates several fds, can be optimized to one
single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).

The only complexity is returning the result of each operation.  For each
epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
the return code *iff* the command is executed (0 for success and -errno of the
equivalent epoll_ctl call), and will be left unchanged if the command is not
executed because some earlier error, for example due to failure of
copy_from_user to copy the array.

Applications can utilize this fact to do error handling: they could initialize
all the epoll_mod_wait.error to a positive value, which is by definition not a
possible output value from epoll_mod_wait. Then when the syscall returned, they
know whether or not the command is executed by comparing each error with the
init value, if they're different, they have the result of the command.
More roughly, they can put any non-zero and not distinguish "not run" from
failure.

Also, timeout parameter is enhanced: timespec is used, compared to the old ms
scalar. This provides higher precision. The parameter field in struct
epoll_wait_spec, "clockid", also makes it possible for users to use a different
clock than the default when it makes more sense.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h |  5 ++++
 2 files changed, 65 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e7a116d..2cc22c9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL);
 }
 
+SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
+		int, ncmds, struct epoll_mod_cmd __user *, cmds,
+		struct epoll_wait_spec __user *, spec)
+{
+	struct epoll_mod_cmd *kcmds = NULL;
+	int i, ret = 0;
+	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
+
+	if (flags)
+		return -EINVAL;
+	if (ncmds) {
+		if (!cmds)
+			return -EINVAL;
+		kcmds = kmalloc(cmd_size, GFP_KERNEL);
+		if (!kcmds)
+			return -ENOMEM;
+		if (copy_from_user(kcmds, cmds, cmd_size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+	}
+	for (i = 0; i < ncmds; i++) {
+		struct epoll_event ev = (struct epoll_event) {
+			.events = kcmds[i].events,
+			.data = kcmds[i].data,
+		};
+		if (kcmds[i].flags) {
+			kcmds[i].error = ret = -EINVAL;
+			goto out;
+		}
+		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
+		if (ret)
+			goto out;
+	}
+	if (spec) {
+		sigset_t ksigmask;
+		struct epoll_wait_spec kspec;
+		ktime_t timeout;
+
+		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
+			return -EFAULT;
+		if (kspec.sigmask) {
+			if (kspec.sigsetsize != sizeof(sigset_t))
+				return -EINVAL;
+			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
+				return -EFAULT;
+		}
+		timeout = timespec_to_ktime(kspec.timeout);
+		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
+				     kspec.clockid, timeout,
+				     kspec.sigmask ? &ksigmask : NULL);
+	}
+
+out:
+	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
+		return -EFAULT;
+	kfree(kcmds);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..7156c80 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,8 @@
 #define _LINUX_SYSCALLS_H
 
 struct epoll_event;
+struct epoll_mod_cmd;
+struct epoll_wait_spec;
 struct iattr;
 struct inode;
 struct iocb;
@@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
+				   int ncmds, struct epoll_mod_cmd __user * cmds,
+				   struct epoll_wait_spec __user * spec);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 6/6] x86: Hook up epoll_mod_wait syscall
  2015-01-20  9:57 ` Fam Zheng
  (?)
@ 2015-01-20  9:57   ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..52aead3 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	epoll_mod_wait		sys_epoll_mod_wait
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..c3c203a 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	epoll_mod_wait		sys_epoll_mod_wait
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 6/6] x86: Hook up epoll_mod_wait syscall
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..52aead3 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	epoll_mod_wait		sys_epoll_mod_wait
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..c3c203a 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	epoll_mod_wait		sys_epoll_mod_wait
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH RFC 6/6] x86: Hook up epoll_mod_wait syscall
@ 2015-01-20  9:57   ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..52aead3 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	epoll_mod_wait		sys_epoll_mod_wait
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..c3c203a 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	epoll_mod_wait		sys_epoll_mod_wait
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 10:37   ` Rasmus Villemoes
  0 siblings, 0 replies; 80+ messages in thread
From: Rasmus Villemoes @ 2015-01-20 10:37 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	linux-fsdevel, linux-api, Josh Triplett,
	Michael Kerrisk (man-pages),
	Paolo Bonzini

On Tue, Jan 20 2015, Fam Zheng <famz@redhat.com> wrote:

> DESCRIPTION
>
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>        
>        2) When you need higher precision than microsecond for wait timeout.

You probably want to say millisecond.

>            struct epoll_mod_cmd {
[...]
>            };


>            struct epoll_wait_spec {
[...]
>            } EPOLL_PACKED;

Either both or none of these should mention that EPOLL_PACKED is in fact
part of the actual definition. The changelog for 3/6 sorta mentions
that it's not really needed for epoll_mod_cmd. Why is it necessary for
either struct?

> RETURN VALUE
>
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

And here, it doesn't make sense to mention a unit, since the new timeout
is given using struct timespec (this was the whole point, right?).

Rasmus

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 10:37   ` Rasmus Villemoes
  0 siblings, 0 replies; 80+ messages in thread
From: Rasmus Villemoes @ 2015-01-20 10:37 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	linux-fsdevel-u79uwXL29TbrhsbdSgBK9A

On Tue, Jan 20 2015, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> DESCRIPTION
>
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>        
>        2) When you need higher precision than microsecond for wait timeout.

You probably want to say millisecond.

>            struct epoll_mod_cmd {
[...]
>            };


>            struct epoll_wait_spec {
[...]
>            } EPOLL_PACKED;

Either both or none of these should mention that EPOLL_PACKED is in fact
part of the actual definition. The changelog for 3/6 sorta mentions
that it's not really needed for epoll_mod_cmd. Why is it necessary for
either struct?

> RETURN VALUE
>
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

And here, it doesn't make sense to mention a unit, since the new timeout
is given using struct timespec (this was the whole point, right?).

Rasmus

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 10:53     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20 10:53 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	linux-fsdevel, linux-api, Josh Triplett,
	Michael Kerrisk (man-pages),
	Paolo Bonzini

On Tue, 01/20 11:37, Rasmus Villemoes wrote:
> On Tue, Jan 20 2015, Fam Zheng <famz@redhat.com> wrote:
> 
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >        
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >        
> >        2) When you need higher precision than microsecond for wait timeout.
> 
> You probably want to say millisecond.

Yes, you see that I just can't make this right. :)

> 
> >            struct epoll_mod_cmd {
> [...]
> >            };
> 
> 
> >            struct epoll_wait_spec {
> [...]
> >            } EPOLL_PACKED;
> 
> Either both or none of these should mention that EPOLL_PACKED is in fact
> part of the actual definition. The changelog for 3/6 sorta mentions
> that it's not really needed for epoll_mod_cmd. Why is it necessary for
> either struct?

Yeah. it's probably not really necessary.

> 
> > RETURN VALUE
> >
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> 
> And here, it doesn't make sense to mention a unit, since the new timeout
> is given using struct timespec (this was the whole point, right?).

Right!

Thanks,
Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 10:53     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-20 10:53 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	linux-fsdevel-u79uwXL29TbrhsbdSgBK9A

On Tue, 01/20 11:37, Rasmus Villemoes wrote:
> On Tue, Jan 20 2015, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >        
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >        
> >        2) When you need higher precision than microsecond for wait timeout.
> 
> You probably want to say millisecond.

Yes, you see that I just can't make this right. :)

> 
> >            struct epoll_mod_cmd {
> [...]
> >            };
> 
> 
> >            struct epoll_wait_spec {
> [...]
> >            } EPOLL_PACKED;
> 
> Either both or none of these should mention that EPOLL_PACKED is in fact
> part of the actual definition. The changelog for 3/6 sorta mentions
> that it's not really needed for epoll_mod_cmd. Why is it necessary for
> either struct?

Yeah. it's probably not really necessary.

> 
> > RETURN VALUE
> >
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> 
> And here, it doesn't make sense to mention a unit, since the new timeout
> is given using struct timespec (this was the whole point, right?).

Right!

Thanks,
Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 12:48   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:48 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel
  Cc: mtk.manpages, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini

Hello Fam Zheng,

I know this API has been through a number of iterations, and there were 
discussions about the design that led to it becoming more complex.
But, let us assume that someone has not seen those discussions,
or forgotten them, or is too lazy to go hunting list archives.

Then: this patch series should somewhere have an explanation of
why the API is what it is, ideally with links to previous relevant
discussions. I see that you do part of that in

    [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait

There are however no links to previous discussions in that mail (I guess
http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
relevant, nor is there any sort of change log in the commit message 
that explains the evolution of the API. Having those would ease the 
task of reviewers.

Coming back to THIS mail, this man page should also include an
explanation of why the API is what it is. That would include much
of the detail from the 5/6 patch, and probably more info besides.

Some specific points below.

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
> 
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
> 
> SYNOPSIS
> 
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
> 
> DESCRIPTION
> 
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.

s/microsecond/millisecond/

> 
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
> 
>            struct epoll_mod_cmd {
> 
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
> 
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
> 
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
> 
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
> 
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
> 
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };
>        
>        There is no guartantee that all the commands are executed in order. Only

s/guartantee/guarantee/

I think the word "all" is not needed in this sentence.

Why is there no guarantee that the commands are run in order?
The order matters if there are operations on the same fd.

>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

Does the operation execute all commands, or stop when it encounters the first 
error? In other words, when looping over the returned 'error' fields, what
is the termination condition for the user-space application?

(Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
but the man page should explicitly state this so that I don't have to 
read the source, and also because it is only if you explicitly document 
the intended behavior that I can tell whether the actual implementation 
matches the intention.)

>        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
>        contains the information about how to poll the events. If it's NULL, this
>        call will immediately return after running all the commands in cmds.
> 
>        The structure is defined as below:
> 
>            struct epoll_wait_spec {
> 
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
> 
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
> 
>                   /* Which clock to use for timeout */
>                   int clockid;

Which clocks can be specified here?
CLOCK_MONOTONIC?
CLOCK_REALTIME?
CLOCK_PROCESS_CPUTIME_ID?
clock_getcpuclockid()?
others?

>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;

Is this timeout relative or absolute?

>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;

I just want to confirm here that 'sigmask' can be NULL, meaning
that we degenerate to epoll_wait() functionality, right?

>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

What is the "EPOLL_PACKED" here for?

> RETURN VALUE
> 
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.
> 
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

s/milliseconds//

> 
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.

s/occured/occurred/

> ERRORS
> 
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.
>
>        EBADF  epfd or fd is not a valid file descriptor.
> 
>        EFAULT The memory area pointed to by events is not accessible with write
>               permissions.
> 
>        EINTR  The call was interrupted by a signal handler before either any of
>               the requested events occurred or the timeout expired; see
>               signal(7).
> 
>        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
>               or equal to zero, or fd is the same as epfd, or the requested
>               operation op is not supported by this interface.

Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

>        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
>               already registered with this epoll instance.
> 
>        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
>               with this epoll instance.
> 
>        ENOMEM There was insufficient memory to handle the requested op control
>               operation.
> 
>        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
>               encountered while trying to register (EPOLL_CTL_ADD) a new file
>               descriptor on an epoll instance.  See epoll(7) for further
>               details.
> 
>        EPERM  The target file fd does not support epoll.
> 
> CONFORMING TO
> 
>        epoll_mod_wait() is Linux-specific.
> 
> SEE ALSO
> 
>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Please add sigprocmask(2).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 12:48   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:48 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

Hello Fam Zheng,

I know this API has been through a number of iterations, and there were 
discussions about the design that led to it becoming more complex.
But, let us assume that someone has not seen those discussions,
or forgotten them, or is too lazy to go hunting list archives.

Then: this patch series should somewhere have an explanation of
why the API is what it is, ideally with links to previous relevant
discussions. I see that you do part of that in

    [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait

There are however no links to previous discussions in that mail (I guess
http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
relevant, nor is there any sort of change log in the commit message 
that explains the evolution of the API. Having those would ease the 
task of reviewers.

Coming back to THIS mail, this man page should also include an
explanation of why the API is what it is. That would include much
of the detail from the 5/6 patch, and probably more info besides.

Some specific points below.

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
> 
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
> 
> SYNOPSIS
> 
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
> 
> DESCRIPTION
> 
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.

s/microsecond/millisecond/

> 
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
> 
>            struct epoll_mod_cmd {
> 
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
> 
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
> 
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
> 
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
> 
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
> 
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };
>        
>        There is no guartantee that all the commands are executed in order. Only

s/guartantee/guarantee/

I think the word "all" is not needed in this sentence.

Why is there no guarantee that the commands are run in order?
The order matters if there are operations on the same fd.

>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

Does the operation execute all commands, or stop when it encounters the first 
error? In other words, when looping over the returned 'error' fields, what
is the termination condition for the user-space application?

(Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
but the man page should explicitly state this so that I don't have to 
read the source, and also because it is only if you explicitly document 
the intended behavior that I can tell whether the actual implementation 
matches the intention.)

>        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
>        contains the information about how to poll the events. If it's NULL, this
>        call will immediately return after running all the commands in cmds.
> 
>        The structure is defined as below:
> 
>            struct epoll_wait_spec {
> 
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
> 
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
> 
>                   /* Which clock to use for timeout */
>                   int clockid;

Which clocks can be specified here?
CLOCK_MONOTONIC?
CLOCK_REALTIME?
CLOCK_PROCESS_CPUTIME_ID?
clock_getcpuclockid()?
others?

>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;

Is this timeout relative or absolute?

>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;

I just want to confirm here that 'sigmask' can be NULL, meaning
that we degenerate to epoll_wait() functionality, right?

>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

What is the "EPOLL_PACKED" here for?

> RETURN VALUE
> 
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.
> 
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

s/milliseconds//

> 
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.

s/occured/occurred/

> ERRORS
> 
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.
>
>        EBADF  epfd or fd is not a valid file descriptor.
> 
>        EFAULT The memory area pointed to by events is not accessible with write
>               permissions.
> 
>        EINTR  The call was interrupted by a signal handler before either any of
>               the requested events occurred or the timeout expired; see
>               signal(7).
> 
>        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
>               or equal to zero, or fd is the same as epfd, or the requested
>               operation op is not supported by this interface.

Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

>        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
>               already registered with this epoll instance.
> 
>        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
>               with this epoll instance.
> 
>        ENOMEM There was insufficient memory to handle the requested op control
>               operation.
> 
>        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
>               encountered while trying to register (EPOLL_CTL_ADD) a new file
>               descriptor on an epoll instance.  See epoll(7) for further
>               details.
> 
>        EPERM  The target file fd does not support epoll.
> 
> CONFORMING TO
> 
>        epoll_mod_wait() is Linux-specific.
> 
> SEE ALSO
> 
>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Please add sigprocmask(2).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 12:48   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:48 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

Hello Fam Zheng,

I know this API has been through a number of iterations, and there were 
discussions about the design that led to it becoming more complex.
But, let us assume that someone has not seen those discussions,
or forgotten them, or is too lazy to go hunting list archives.

Then: this patch series should somewhere have an explanation of
why the API is what it is, ideally with links to previous relevant
discussions. I see that you do part of that in

    [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait

There are however no links to previous discussions in that mail (I guess
http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
relevant, nor is there any sort of change log in the commit message 
that explains the evolution of the API. Having those would ease the 
task of reviewers.

Coming back to THIS mail, this man page should also include an
explanation of why the API is what it is. That would include much
of the detail from the 5/6 patch, and probably more info besides.

Some specific points below.

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
> 
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
> 
> SYNOPSIS
> 
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
> 
> DESCRIPTION
> 
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>        
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.

s/microsecond/millisecond/

> 
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
> 
>            struct epoll_mod_cmd {
> 
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
> 
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
> 
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
> 
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
> 
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
> 
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };
>        
>        There is no guartantee that all the commands are executed in order. Only

s/guartantee/guarantee/

I think the word "all" is not needed in this sentence.

Why is there no guarantee that the commands are run in order?
The order matters if there are operations on the same fd.

>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

Does the operation execute all commands, or stop when it encounters the first 
error? In other words, when looping over the returned 'error' fields, what
is the termination condition for the user-space application?

(Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
but the man page should explicitly state this so that I don't have to 
read the source, and also because it is only if you explicitly document 
the intended behavior that I can tell whether the actual implementation 
matches the intention.)

>        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
>        contains the information about how to poll the events. If it's NULL, this
>        call will immediately return after running all the commands in cmds.
> 
>        The structure is defined as below:
> 
>            struct epoll_wait_spec {
> 
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
> 
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
> 
>                   /* Which clock to use for timeout */
>                   int clockid;

Which clocks can be specified here?
CLOCK_MONOTONIC?
CLOCK_REALTIME?
CLOCK_PROCESS_CPUTIME_ID?
clock_getcpuclockid()?
others?

>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;

Is this timeout relative or absolute?

>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;

I just want to confirm here that 'sigmask' can be NULL, meaning
that we degenerate to epoll_wait() functionality, right?

>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

What is the "EPOLL_PACKED" here for?

> RETURN VALUE
> 
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.
> 
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.

s/milliseconds//

> 
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.

s/occured/occurred/

> ERRORS
> 
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.
>
>        EBADF  epfd or fd is not a valid file descriptor.
> 
>        EFAULT The memory area pointed to by events is not accessible with write
>               permissions.
> 
>        EINTR  The call was interrupted by a signal handler before either any of
>               the requested events occurred or the timeout expired; see
>               signal(7).
> 
>        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
>               or equal to zero, or fd is the same as epfd, or the requested
>               operation op is not supported by this interface.

Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

>        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
>               already registered with this epoll instance.
> 
>        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
>               with this epoll instance.
> 
>        ENOMEM There was insufficient memory to handle the requested op control
>               operation.
> 
>        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
>               encountered while trying to register (EPOLL_CTL_ADD) a new file
>               descriptor on an epoll instance.  See epoll(7) for further
>               details.
> 
>        EPERM  The target file fd does not support epoll.
> 
> CONFORMING TO
> 
>        epoll_mod_wait() is Linux-specific.
> 
> SEE ALSO
> 
>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Please add sigprocmask(2).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
  2015-01-20  9:57   ` Fam Zheng
  (?)
@ 2015-01-20 12:50     ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:50 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel
  Cc: mtk.manpages, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini

Hello Fam Zheng,

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. 

Which applications? Could we have some specific examples? This is a 
complex API, and it needs good justification.

> For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
         ^^^^^^^^^^^^^^ should be epoll_mod_wait

I think you mean to say:

    The ability to batch multiple "epoll_ctl" operations into a single call
    means that even when no wait events are requested (i.e., spec == NULL),
    poll_mod_wait() provides a performance optimization over using multiple
    epoll_ctl() calls.

Right? If yes, please amend the commit message, and this text should
also make its way into the revised man page under a heading "NOTES".

> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.

The "cmds' are not executed in a specified order plus the need to
initialize the 'errors' fields to a positive value feels a bit ugly.
And indeed the whole "command list was only partially run" case
is not pretty. Am I correct to understand that if an error is found
during execution of one of the "epoll_ctl" commands in 'cmds' then
the system call will return -1 with errno set, indicating an error,
even though the epoll interest list may have changed because some
of the earlier 'cmds' executed successfully? This all seems a bit of
a headache for user space.

I have a couple of questions:

Q1. I can see that batching "epoll_ctl" commands might be useful,
since it results in fewer systems calls. But, does it really
need to be bound together with the "epoll_pwait" functionality?
(Perhaps this point was covered in previous discussions, but
neither the message accompanying this patch nor the 0/6 man page
provide a compelling rationale for the need to bind these two
operations together.)

Yes, I realize you might save a system call, but it makes for a
cumbersome API that has the above headache, and also forces the 
need for double pointer indirection in the 'spec' argument (i.e., 
spec is a pointer to an array of structures where each element
in turn includes an 'events' pointer that points to another array).

Why not a simpler API with two syscalls such as:

epoll_ctl_batch(int epfd, int flags,
                int ncmds, struct epoll_mod_cmd *cmds);

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, 
             const sigset_t *sigmask, size_t sigsetsize);

This gives us much of the benefit of reducing system calls, but 
with greater simplicity. And epoll_ctl_batch() could simply return
the number of 'cmds' that were successfully executed.)

Q2. In the man page in 0/6 you said that the 'cmds' were not 
guaranteed to be executed in order. Why not? If you did provide
such a guarantee, then, when using your current epoll_mod_wait(),
user space could do the following:

1. Initialize the cmd.errors fields to zero.
2. Call epoll_ctl_mod()
3. Iterate through cmd.errors looking for the first nonzero 
   field.

> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. 

Yes, that change seemed inevitable. It slightly puzzled me at the time when
Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
though pselect() already had demonstrated the need for higher precision.
I should have called it out way back then :-{.

> The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;

To make the 'ret' change a little more obvious, maybe it's better to write

			ret = kcmds[i].error = -EINVAL;

> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

Likewise:
		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))

Cosmetic point: s/if(/if (/

> +			return -EFAULT;
> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);

If I understand correctly, the implementation means that the
'size_t sigsetsize' field will probably need to be exposed to 
applications. In the existing epoll_pwait() call (as in  ppoll()
and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
However, unless we expect glibc to do some structure copying to/from
a structure that hides this field, then we're going end up exposing
'size_t sigsetsize' to applications. (This could be avoided, if we
split the API as I suggest above. glibc would do the same thing 
in epoll_pwait1() that it currently does in epoll_pwait().)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-20 12:50     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:50 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel
  Cc: mtk.manpages, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

Hello Fam Zheng,

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. 

Which applications? Could we have some specific examples? This is a 
complex API, and it needs good justification.

> For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
         ^^^^^^^^^^^^^^ should be epoll_mod_wait

I think you mean to say:

    The ability to batch multiple "epoll_ctl" operations into a single call
    means that even when no wait events are requested (i.e., spec == NULL),
    poll_mod_wait() provides a performance optimization over using multiple
    epoll_ctl() calls.

Right? If yes, please amend the commit message, and this text should
also make its way into the revised man page under a heading "NOTES".

> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.

The "cmds' are not executed in a specified order plus the need to
initialize the 'errors' fields to a positive value feels a bit ugly.
And indeed the whole "command list was only partially run" case
is not pretty. Am I correct to understand that if an error is found
during execution of one of the "epoll_ctl" commands in 'cmds' then
the system call will return -1 with errno set, indicating an error,
even though the epoll interest list may have changed because some
of the earlier 'cmds' executed successfully? This all seems a bit of
a headache for user space.

I have a couple of questions:

Q1. I can see that batching "epoll_ctl" commands might be useful,
since it results in fewer systems calls. But, does it really
need to be bound together with the "epoll_pwait" functionality?
(Perhaps this point was covered in previous discussions, but
neither the message accompanying this patch nor the 0/6 man page
provide a compelling rationale for the need to bind these two
operations together.)

Yes, I realize you might save a system call, but it makes for a
cumbersome API that has the above headache, and also forces the 
need for double pointer indirection in the 'spec' argument (i.e., 
spec is a pointer to an array of structures where each element
in turn includes an 'events' pointer that points to another array).

Why not a simpler API with two syscalls such as:

epoll_ctl_batch(int epfd, int flags,
                int ncmds, struct epoll_mod_cmd *cmds);

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, 
             const sigset_t *sigmask, size_t sigsetsize);

This gives us much of the benefit of reducing system calls, but 
with greater simplicity. And epoll_ctl_batch() could simply return
the number of 'cmds' that were successfully executed.)

Q2. In the man page in 0/6 you said that the 'cmds' were not 
guaranteed to be executed in order. Why not? If you did provide
such a guarantee, then, when using your current epoll_mod_wait(),
user space could do the following:

1. Initialize the cmd.errors fields to zero.
2. Call epoll_ctl_mod()
3. Iterate through cmd.errors looking for the first nonzero 
   field.

> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. 

Yes, that change seemed inevitable. It slightly puzzled me at the time when
Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
though pselect() already had demonstrated the need for higher precision.
I should have called it out way back then :-{.

> The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;

To make the 'ret' change a little more obvious, maybe it's better to write

			ret = kcmds[i].error = -EINVAL;

> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

Likewise:
		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))

Cosmetic point: s/if(/if (/

> +			return -EFAULT;
> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);

If I understand correctly, the implementation means that the
'size_t sigsetsize' field will probably need to be exposed to 
applications. In the existing epoll_pwait() call (as in  ppoll()
and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
However, unless we expect glibc to do some structure copying to/from
a structure that hides this field, then we're going end up exposing
'size_t sigsetsize' to applications. (This could be avoided, if we
split the API as I suggest above. glibc would do the same thing 
in epoll_pwait1() that it currently does in epoll_pwait().)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-20 12:50     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-20 12:50 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel
  Cc: mtk.manpages, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

Hello Fam Zheng,

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. 

Which applications? Could we have some specific examples? This is a 
complex API, and it needs good justification.

> For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
         ^^^^^^^^^^^^^^ should be epoll_mod_wait

I think you mean to say:

    The ability to batch multiple "epoll_ctl" operations into a single call
    means that even when no wait events are requested (i.e., spec == NULL),
    poll_mod_wait() provides a performance optimization over using multiple
    epoll_ctl() calls.

Right? If yes, please amend the commit message, and this text should
also make its way into the revised man page under a heading "NOTES".

> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.

The "cmds' are not executed in a specified order plus the need to
initialize the 'errors' fields to a positive value feels a bit ugly.
And indeed the whole "command list was only partially run" case
is not pretty. Am I correct to understand that if an error is found
during execution of one of the "epoll_ctl" commands in 'cmds' then
the system call will return -1 with errno set, indicating an error,
even though the epoll interest list may have changed because some
of the earlier 'cmds' executed successfully? This all seems a bit of
a headache for user space.

I have a couple of questions:

Q1. I can see that batching "epoll_ctl" commands might be useful,
since it results in fewer systems calls. But, does it really
need to be bound together with the "epoll_pwait" functionality?
(Perhaps this point was covered in previous discussions, but
neither the message accompanying this patch nor the 0/6 man page
provide a compelling rationale for the need to bind these two
operations together.)

Yes, I realize you might save a system call, but it makes for a
cumbersome API that has the above headache, and also forces the 
need for double pointer indirection in the 'spec' argument (i.e., 
spec is a pointer to an array of structures where each element
in turn includes an 'events' pointer that points to another array).

Why not a simpler API with two syscalls such as:

epoll_ctl_batch(int epfd, int flags,
                int ncmds, struct epoll_mod_cmd *cmds);

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, 
             const sigset_t *sigmask, size_t sigsetsize);

This gives us much of the benefit of reducing system calls, but 
with greater simplicity. And epoll_ctl_batch() could simply return
the number of 'cmds' that were successfully executed.)

Q2. In the man page in 0/6 you said that the 'cmds' were not 
guaranteed to be executed in order. Why not? If you did provide
such a guarantee, then, when using your current epoll_mod_wait(),
user space could do the following:

1. Initialize the cmd.errors fields to zero.
2. Call epoll_ctl_mod()
3. Iterate through cmd.errors looking for the first nonzero 
   field.

> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. 

Yes, that change seemed inevitable. It slightly puzzled me at the time when
Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
though pselect() already had demonstrated the need for higher precision.
I should have called it out way back then :-{.

> The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;

To make the 'ret' change a little more obvious, maybe it's better to write

			ret = kcmds[i].error = -EINVAL;

> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

Likewise:
		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))

Cosmetic point: s/if(/if (/

> +			return -EFAULT;
> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);

If I understand correctly, the implementation means that the
'size_t sigsetsize' field will probably need to be exposed to 
applications. In the existing epoll_pwait() call (as in  ppoll()
and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
However, unless we expect glibc to do some structure copying to/from
a structure that hides this field, then we're going end up exposing
'size_t sigsetsize' to applications. (This could be avoided, if we
split the API as I suggest above. glibc would do the same thing 
in epoll_pwait1() that it currently does in epoll_pwait().)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 22:40   ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2015-01-20 22:40 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	Linux FS Devel, Linux API, Josh Triplett,
	Michael Kerrisk (man-pages),
	Paolo Bonzini

On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz@redhat.com> wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
>
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
>
> SYNOPSIS
>
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
>
> DESCRIPTION
>
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.
>
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
>
>            struct epoll_mod_cmd {
>
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
>
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
>
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
>
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
>
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
>
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };

I would add an extra u32 at the end so that the structure size will be
a multiple of 8 bytes on all platforms.

>
>        There is no guartantee that all the commands are executed in order. Only
>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

If this doesn't happen, what error is returned?

>            struct epoll_wait_spec {
>
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
>
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
>
>                   /* Which clock to use for timeout */
>                   int clockid;
>
>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;
>
>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;
>
>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

I think the convention is to align the structure's fields manually
rather than declaring it to be packed.

>
> RETURN VALUE
>
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.

Does this mean that callers should initialize the error fields to an
impossible value first so they can tell which commands were executed?

>
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.
>
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.
>
> ERRORS
>
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.

Please clarify which errors are returned overall and which are per-command.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 22:40   ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2015-01-20 22:40 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
>
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
>
> SYNOPSIS
>
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
>
> DESCRIPTION
>
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.
>
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
>
>            struct epoll_mod_cmd {
>
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
>
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
>
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
>
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
>
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
>
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };

I would add an extra u32 at the end so that the structure size will be
a multiple of 8 bytes on all platforms.

>
>        There is no guartantee that all the commands are executed in order. Only
>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

If this doesn't happen, what error is returned?

>            struct epoll_wait_spec {
>
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
>
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
>
>                   /* Which clock to use for timeout */
>                   int clockid;
>
>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;
>
>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;
>
>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

I think the convention is to align the structure's fields manually
rather than declaring it to be packed.

>
> RETURN VALUE
>
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.

Does this mean that callers should initialize the error fields to an
impossible value first so they can tell which commands were executed?

>
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.
>
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.
>
> ERRORS
>
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.

Please clarify which errors are returned overall and which are per-command.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 22:40   ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2015-01-20 22:40 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> This adds a new system call, epoll_mod_wait. It's described as below:
>
> NAME
>        epoll_mod_wait - modify and wait for I/O events on an epoll file
>                         descriptor
>
> SYNOPSIS
>
>        int epoll_mod_wait(int epfd, int flags,
>                           int ncmds, struct epoll_mod_cmd *cmds,
>                           struct epoll_wait_spec *spec);
>
> DESCRIPTION
>
>        The epoll_mod_wait() system call can be seen as an enhanced combination
>        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
>        call. It is superior in two cases:
>
>        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
>        will save context switches between user mode and kernel mode;
>
>        2) When you need higher precision than microsecond for wait timeout.
>
>        The epoll_ctl(2) operations are embedded into this call by with ncmds
>        and cmds. The latter is an array of command structs:
>
>            struct epoll_mod_cmd {
>
>                   /* Reserved flags for future extension, must be 0 for now. */
>                   int flags;
>
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
>
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
>
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
>
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
>
>                   /* Output field, will be set to the return code once this
>                    * command is executed by kernel */
>                   int error;
>            };

I would add an extra u32 at the end so that the structure size will be
a multiple of 8 bytes on all platforms.

>
>        There is no guartantee that all the commands are executed in order. Only
>        if all the commands are successfully executed (all the error fields are
>        set to 0), events are polled.

If this doesn't happen, what error is returned?

>            struct epoll_wait_spec {
>
>                   /* The same as "maxevents" in epoll_pwait() */
>                   int maxevents;
>
>                   /* The same as "events" in epoll_pwait() */
>                   struct epoll_event *events;
>
>                   /* Which clock to use for timeout */
>                   int clockid;
>
>                   /* Maximum time to wait if there is no event */
>                   struct timespec timeout;
>
>                   /* The same as "sigmask" in epoll_pwait() */
>                   sigset_t *sigmask;
>
>                   /* The same as "sigsetsize" in epoll_pwait() */
>                   size_t sigsetsize;
>            } EPOLL_PACKED;

I think the convention is to align the structure's fields manually
rather than declaring it to be packed.

>
> RETURN VALUE
>
>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>        appropriately. All the "error" fields in cmds are unchanged before they
>        are executed, and if any cmds are executed, the "error" fields are set
>        to a return code accordingly. See also epoll_ctl for more details of the
>        return code.

Does this mean that callers should initialize the error fields to an
impossible value first so they can tell which commands were executed?

>
>        When successful, epoll_mod_wait() returns the number of file
>        descriptors ready for the requested I/O, or zero if no file descriptor
>        became ready during the requested timeout milliseconds.
>
>        If spec is NULL, it returns 0 if all the commands are successful, and -1
>        if an error occured.
>
> ERRORS
>
>        These errors apply on either the return value of epoll_mod_wait or error
>        status for each command, respectively.

Please clarify which errors are returned overall and which are per-command.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 23:03     ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 80+ messages in thread
From: josh @ 2015-01-20 23:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fam Zheng, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Kees Cook,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, Linux FS Devel, Linux API,
	Michael Kerrisk (man-pages),
	Paolo Bonzini

On Tue, Jan 20, 2015 at 02:40:32PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz@redhat.com> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> >
> > SYNOPSIS
> >
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> >
> >        The epoll_ctl(2) operations are embedded into this call by with ncmds
> >        and cmds. The latter is an array of command structs:
> >
> >            struct epoll_mod_cmd {
> >
> >                   /* Reserved flags for future extension, must be 0 for now. */
> >                   int flags;
> >
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> >
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> >
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> >
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> >
> >                   /* Output field, will be set to the return code once this
> >                    * command is executed by kernel */
> >                   int error;
> >            };
> 
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.

*shrug*, but if you do so, enforce that it has a value of 0 or return
-EINVAL, just like a flags field.  Alternatively, move the last field
earlier and make flags a uint64_t.

- Josh Triplett

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 23:03     ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 80+ messages in thread
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-01-20 23:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers

On Tue, Jan 20, 2015 at 02:40:32PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> >
> > SYNOPSIS
> >
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> >
> >        The epoll_ctl(2) operations are embedded into this call by with ncmds
> >        and cmds. The latter is an array of command structs:
> >
> >            struct epoll_mod_cmd {
> >
> >                   /* Reserved flags for future extension, must be 0 for now. */
> >                   int flags;
> >
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> >
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> >
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> >
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> >
> >                   /* Output field, will be set to the return code once this
> >                    * command is executed by kernel */
> >                   int error;
> >            };
> 
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.

*shrug*, but if you do so, enforce that it has a value of 0 or return
-EINVAL, just like a flags field.  Alternatively, move the last field
earlier and make flags a uint64_t.

- Josh Triplett

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-20 23:03     ` josh-iaAMLnmF4UmaiuxdJuQwMA
  0 siblings, 0 replies; 80+ messages in thread
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-01-20 23:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fam Zheng, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers

On Tue, Jan 20, 2015 at 02:40:32PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> >
> > SYNOPSIS
> >
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> >
> >        The epoll_ctl(2) operations are embedded into this call by with ncmds
> >        and cmds. The latter is an array of command structs:
> >
> >            struct epoll_mod_cmd {
> >
> >                   /* Reserved flags for future extension, must be 0 for now. */
> >                   int flags;
> >
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> >
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> >
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> >
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> >
> >                   /* Output field, will be set to the return code once this
> >                    * command is executed by kernel */
> >                   int error;
> >            };
> 
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.

*shrug*, but if you do so, enforce that it has a value of 0 or return
-EINVAL, just like a flags field.  Alternatively, move the last field
earlier and make flags a uint64_t.

- Josh Triplett

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
  2015-01-20 12:50     ` Michael Kerrisk (man-pages)
  (?)
@ 2015-01-21  4:59       ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  4:59 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini

On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This syscall is a sequence of
> > 
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> > 
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. 
> 
> Which applications? Could we have some specific examples? This is a 
> complex API, and it needs good justification.

OK, I'll explain more in v2.

> 
> > For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> > 
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
>          ^^^^^^^^^^^^^^ should be epoll_mod_wait
> 
> I think you mean to say:
> 
>     The ability to batch multiple "epoll_ctl" operations into a single call
>     means that even when no wait events are requested (i.e., spec == NULL),
>     poll_mod_wait() provides a performance optimization over using multiple
>     epoll_ctl() calls.
> 
> Right? If yes, please amend the commit message, and this text should
> also make its way into the revised man page under a heading "NOTES".

OK.

> 
> > The only complexity is returning the result of each operation.  For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> > 
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> 
> The "cmds' are not executed in a specified order plus the need to
> initialize the 'errors' fields to a positive value feels a bit ugly.
> And indeed the whole "command list was only partially run" case
> is not pretty. Am I correct to understand that if an error is found
> during execution of one of the "epoll_ctl" commands in 'cmds' then
> the system call will return -1 with errno set, indicating an error,
> even though the epoll interest list may have changed because some
> of the earlier 'cmds' executed successfully? This all seems a bit of
> a headache for user space.

This is the trade-off for batching. The best we can do is probably make this
transactional: none or all of the commands succeeds. It will require a much
more complex implementation, though. But even with that, the error reporting on
which command failed is a complication.

> 
> I have a couple of questions:
> 
> Q1. I can see that batching "epoll_ctl" commands might be useful,
> since it results in fewer systems calls. But, does it really
> need to be bound together with the "epoll_pwait" functionality?
> (Perhaps this point was covered in previous discussions, but
> neither the message accompanying this patch nor the 0/6 man page
> provide a compelling rationale for the need to bind these two
> operations together.)
> 
> Yes, I realize you might save a system call, but it makes for a
> cumbersome API that has the above headache, and also forces the 
> need for double pointer indirection in the 'spec' argument (i.e., 
> spec is a pointer to an array of structures where each element
> in turn includes an 'events' pointer that points to another array).
> 
> Why not a simpler API with two syscalls such as:
> 
> epoll_ctl_batch(int epfd, int flags,
>                 int ncmds, struct epoll_mod_cmd *cmds);
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, 
>              const sigset_t *sigmask, size_t sigsetsize);

The problem is that there is no room for flags field in epoll_pwait1, which is
asked for, in previous discussion thread [1].

I don't see epoll_mod_wait as a *significantly more* complicated interface
compared to epoll_ctl_batch and epoll_pwait1 above. In epoll_mod_wait, if you
leave out ncmds and cmds, it is effectively a poll without batch; and if
leaving out spec, it is effectively a batch without poll.

The most important change here is the timeout. IMO I wouldn't mind leaving out
batching. Integrating it is requested by Andy:

[1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591

which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
number of times right before epoll_wait.

[Sorry for not putting anything into cover letter changelog, but it is also
interesting to see people's reaction on the patch itself without bias of
others' opinions. This indeed brings in more points. :]

> 
> This gives us much of the benefit of reducing system calls, but 
> with greater simplicity. And epoll_ctl_batch() could simply return
> the number of 'cmds' that were successfully executed.)
> 
> Q2. In the man page in 0/6 you said that the 'cmds' were not 
> guaranteed to be executed in order. Why not? If you did provide
> such a guarantee, then, when using your current epoll_mod_wait(),
> user space could do the following:

I guess we can make a guarentee on that.

> 
> 1. Initialize the cmd.errors fields to zero.
> 2. Call epoll_ctl_mod()
> 3. Iterate through cmd.errors looking for the first nonzero 
>    field.

It's close, but zero is not good enough, if copy_from_user of cmds failed in
the first place. Impossible value, or error value, will be safer.

> 
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. 
> 
> Yes, that change seemed inevitable. It slightly puzzled me at the time when
> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
> though pselect() already had demonstrated the need for higher precision.
> I should have called it out way back then :-{.
> 
> > The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/syscalls.h |  5 ++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >  			      sigmask ? &ksigmask : NULL);
> >  }
> >  
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > +		struct epoll_wait_spec __user *, spec)
> > +{
> > +	struct epoll_mod_cmd *kcmds = NULL;
> > +	int i, ret = 0;
> > +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (ncmds) {
> > +		if (!cmds)
> > +			return -EINVAL;
> > +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > +		if (!kcmds)
> > +			return -ENOMEM;
> > +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +	}
> > +	for (i = 0; i < ncmds; i++) {
> > +		struct epoll_event ev = (struct epoll_event) {
> > +			.events = kcmds[i].events,
> > +			.data = kcmds[i].data,
> > +		};
> > +		if (kcmds[i].flags) {
> > +			kcmds[i].error = ret = -EINVAL;
> 
> To make the 'ret' change a little more obvious, maybe it's better to write
> 
> 			ret = kcmds[i].error = -EINVAL;
> 
> > +			goto out;
> > +		}
> > +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> 
> Likewise:
> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> 
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	if (spec) {
> > +		sigset_t ksigmask;
> > +		struct epoll_wait_spec kspec;
> > +		ktime_t timeout;
> > +
> > +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> 
> Cosmetic point: s/if(/if (/
> 
> > +			return -EFAULT;
> > +		if (kspec.sigmask) {
> > +			if (kspec.sigsetsize != sizeof(sigset_t))
> > +				return -EINVAL;
> > +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > +				return -EFAULT;
> > +		}
> > +		timeout = timespec_to_ktime(kspec.timeout);
> > +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > +				     kspec.clockid, timeout,
> > +				     kspec.sigmask ? &ksigmask : NULL);
> 
> If I understand correctly, the implementation means that the
> 'size_t sigsetsize' field will probably need to be exposed to 
> applications. In the existing epoll_pwait() call (as in  ppoll()
> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
> However, unless we expect glibc to do some structure copying to/from
> a structure that hides this field, then we're going end up exposing
> 'size_t sigsetsize' to applications. (This could be avoided, if we
> split the API as I suggest above. glibc would do the same thing 
> in epoll_pwait1() that it currently does in epoll_pwait().)
> 
> Thanks,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  4:59       ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  4:59 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This syscall is a sequence of
> > 
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> > 
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. 
> 
> Which applications? Could we have some specific examples? This is a 
> complex API, and it needs good justification.

OK, I'll explain more in v2.

> 
> > For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> > 
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
>          ^^^^^^^^^^^^^^ should be epoll_mod_wait
> 
> I think you mean to say:
> 
>     The ability to batch multiple "epoll_ctl" operations into a single call
>     means that even when no wait events are requested (i.e., spec == NULL),
>     poll_mod_wait() provides a performance optimization over using multiple
>     epoll_ctl() calls.
> 
> Right? If yes, please amend the commit message, and this text should
> also make its way into the revised man page under a heading "NOTES".

OK.

> 
> > The only complexity is returning the result of each operation.  For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> > 
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> 
> The "cmds' are not executed in a specified order plus the need to
> initialize the 'errors' fields to a positive value feels a bit ugly.
> And indeed the whole "command list was only partially run" case
> is not pretty. Am I correct to understand that if an error is found
> during execution of one of the "epoll_ctl" commands in 'cmds' then
> the system call will return -1 with errno set, indicating an error,
> even though the epoll interest list may have changed because some
> of the earlier 'cmds' executed successfully? This all seems a bit of
> a headache for user space.

This is the trade-off for batching. The best we can do is probably make this
transactional: none or all of the commands succeeds. It will require a much
more complex implementation, though. But even with that, the error reporting on
which command failed is a complication.

> 
> I have a couple of questions:
> 
> Q1. I can see that batching "epoll_ctl" commands might be useful,
> since it results in fewer systems calls. But, does it really
> need to be bound together with the "epoll_pwait" functionality?
> (Perhaps this point was covered in previous discussions, but
> neither the message accompanying this patch nor the 0/6 man page
> provide a compelling rationale for the need to bind these two
> operations together.)
> 
> Yes, I realize you might save a system call, but it makes for a
> cumbersome API that has the above headache, and also forces the 
> need for double pointer indirection in the 'spec' argument (i.e., 
> spec is a pointer to an array of structures where each element
> in turn includes an 'events' pointer that points to another array).
> 
> Why not a simpler API with two syscalls such as:
> 
> epoll_ctl_batch(int epfd, int flags,
>                 int ncmds, struct epoll_mod_cmd *cmds);
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, 
>              const sigset_t *sigmask, size_t sigsetsize);

The problem is that there is no room for flags field in epoll_pwait1, which is
asked for, in previous discussion thread [1].

I don't see epoll_mod_wait as a *significantly more* complicated interface
compared to epoll_ctl_batch and epoll_pwait1 above. In epoll_mod_wait, if you
leave out ncmds and cmds, it is effectively a poll without batch; and if
leaving out spec, it is effectively a batch without poll.

The most important change here is the timeout. IMO I wouldn't mind leaving out
batching. Integrating it is requested by Andy:

[1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591

which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
number of times right before epoll_wait.

[Sorry for not putting anything into cover letter changelog, but it is also
interesting to see people's reaction on the patch itself without bias of
others' opinions. This indeed brings in more points. :]

> 
> This gives us much of the benefit of reducing system calls, but 
> with greater simplicity. And epoll_ctl_batch() could simply return
> the number of 'cmds' that were successfully executed.)
> 
> Q2. In the man page in 0/6 you said that the 'cmds' were not 
> guaranteed to be executed in order. Why not? If you did provide
> such a guarantee, then, when using your current epoll_mod_wait(),
> user space could do the following:

I guess we can make a guarentee on that.

> 
> 1. Initialize the cmd.errors fields to zero.
> 2. Call epoll_ctl_mod()
> 3. Iterate through cmd.errors looking for the first nonzero 
>    field.

It's close, but zero is not good enough, if copy_from_user of cmds failed in
the first place. Impossible value, or error value, will be safer.

> 
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. 
> 
> Yes, that change seemed inevitable. It slightly puzzled me at the time when
> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
> though pselect() already had demonstrated the need for higher precision.
> I should have called it out way back then :-{.
> 
> > The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/syscalls.h |  5 ++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >  			      sigmask ? &ksigmask : NULL);
> >  }
> >  
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > +		struct epoll_wait_spec __user *, spec)
> > +{
> > +	struct epoll_mod_cmd *kcmds = NULL;
> > +	int i, ret = 0;
> > +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (ncmds) {
> > +		if (!cmds)
> > +			return -EINVAL;
> > +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > +		if (!kcmds)
> > +			return -ENOMEM;
> > +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +	}
> > +	for (i = 0; i < ncmds; i++) {
> > +		struct epoll_event ev = (struct epoll_event) {
> > +			.events = kcmds[i].events,
> > +			.data = kcmds[i].data,
> > +		};
> > +		if (kcmds[i].flags) {
> > +			kcmds[i].error = ret = -EINVAL;
> 
> To make the 'ret' change a little more obvious, maybe it's better to write
> 
> 			ret = kcmds[i].error = -EINVAL;
> 
> > +			goto out;
> > +		}
> > +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> 
> Likewise:
> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> 
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	if (spec) {
> > +		sigset_t ksigmask;
> > +		struct epoll_wait_spec kspec;
> > +		ktime_t timeout;
> > +
> > +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> 
> Cosmetic point: s/if(/if (/
> 
> > +			return -EFAULT;
> > +		if (kspec.sigmask) {
> > +			if (kspec.sigsetsize != sizeof(sigset_t))
> > +				return -EINVAL;
> > +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > +				return -EFAULT;
> > +		}
> > +		timeout = timespec_to_ktime(kspec.timeout);
> > +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > +				     kspec.clockid, timeout,
> > +				     kspec.sigmask ? &ksigmask : NULL);
> 
> If I understand correctly, the implementation means that the
> 'size_t sigsetsize' field will probably need to be exposed to 
> applications. In the existing epoll_pwait() call (as in  ppoll()
> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
> However, unless we expect glibc to do some structure copying to/from
> a structure that hides this field, then we're going end up exposing
> 'size_t sigsetsize' to applications. (This could be avoided, if we
> split the API as I suggest above. glibc would do the same thing 
> in epoll_pwait1() that it currently does in epoll_pwait().)
> 
> Thanks,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  4:59       ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  4:59 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This syscall is a sequence of
> > 
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> > 
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. 
> 
> Which applications? Could we have some specific examples? This is a 
> complex API, and it needs good justification.

OK, I'll explain more in v2.

> 
> > For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> > 
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
>          ^^^^^^^^^^^^^^ should be epoll_mod_wait
> 
> I think you mean to say:
> 
>     The ability to batch multiple "epoll_ctl" operations into a single call
>     means that even when no wait events are requested (i.e., spec == NULL),
>     poll_mod_wait() provides a performance optimization over using multiple
>     epoll_ctl() calls.
> 
> Right? If yes, please amend the commit message, and this text should
> also make its way into the revised man page under a heading "NOTES".

OK.

> 
> > The only complexity is returning the result of each operation.  For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> > 
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> 
> The "cmds' are not executed in a specified order plus the need to
> initialize the 'errors' fields to a positive value feels a bit ugly.
> And indeed the whole "command list was only partially run" case
> is not pretty. Am I correct to understand that if an error is found
> during execution of one of the "epoll_ctl" commands in 'cmds' then
> the system call will return -1 with errno set, indicating an error,
> even though the epoll interest list may have changed because some
> of the earlier 'cmds' executed successfully? This all seems a bit of
> a headache for user space.

This is the trade-off for batching. The best we can do is probably make this
transactional: none or all of the commands succeeds. It will require a much
more complex implementation, though. But even with that, the error reporting on
which command failed is a complication.

> 
> I have a couple of questions:
> 
> Q1. I can see that batching "epoll_ctl" commands might be useful,
> since it results in fewer systems calls. But, does it really
> need to be bound together with the "epoll_pwait" functionality?
> (Perhaps this point was covered in previous discussions, but
> neither the message accompanying this patch nor the 0/6 man page
> provide a compelling rationale for the need to bind these two
> operations together.)
> 
> Yes, I realize you might save a system call, but it makes for a
> cumbersome API that has the above headache, and also forces the 
> need for double pointer indirection in the 'spec' argument (i.e., 
> spec is a pointer to an array of structures where each element
> in turn includes an 'events' pointer that points to another array).
> 
> Why not a simpler API with two syscalls such as:
> 
> epoll_ctl_batch(int epfd, int flags,
>                 int ncmds, struct epoll_mod_cmd *cmds);
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, 
>              const sigset_t *sigmask, size_t sigsetsize);

The problem is that there is no room for flags field in epoll_pwait1, which is
asked for, in previous discussion thread [1].

I don't see epoll_mod_wait as a *significantly more* complicated interface
compared to epoll_ctl_batch and epoll_pwait1 above. In epoll_mod_wait, if you
leave out ncmds and cmds, it is effectively a poll without batch; and if
leaving out spec, it is effectively a batch without poll.

The most important change here is the timeout. IMO I wouldn't mind leaving out
batching. Integrating it is requested by Andy:

[1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591

which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
number of times right before epoll_wait.

[Sorry for not putting anything into cover letter changelog, but it is also
interesting to see people's reaction on the patch itself without bias of
others' opinions. This indeed brings in more points. :]

> 
> This gives us much of the benefit of reducing system calls, but 
> with greater simplicity. And epoll_ctl_batch() could simply return
> the number of 'cmds' that were successfully executed.)
> 
> Q2. In the man page in 0/6 you said that the 'cmds' were not 
> guaranteed to be executed in order. Why not? If you did provide
> such a guarantee, then, when using your current epoll_mod_wait(),
> user space could do the following:

I guess we can make a guarentee on that.

> 
> 1. Initialize the cmd.errors fields to zero.
> 2. Call epoll_ctl_mod()
> 3. Iterate through cmd.errors looking for the first nonzero 
>    field.

It's close, but zero is not good enough, if copy_from_user of cmds failed in
the first place. Impossible value, or error value, will be safer.

> 
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. 
> 
> Yes, that change seemed inevitable. It slightly puzzled me at the time when
> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
> though pselect() already had demonstrated the need for higher precision.
> I should have called it out way back then :-{.
> 
> > The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/syscalls.h |  5 ++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >  			      sigmask ? &ksigmask : NULL);
> >  }
> >  
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > +		struct epoll_wait_spec __user *, spec)
> > +{
> > +	struct epoll_mod_cmd *kcmds = NULL;
> > +	int i, ret = 0;
> > +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (ncmds) {
> > +		if (!cmds)
> > +			return -EINVAL;
> > +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > +		if (!kcmds)
> > +			return -ENOMEM;
> > +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +	}
> > +	for (i = 0; i < ncmds; i++) {
> > +		struct epoll_event ev = (struct epoll_event) {
> > +			.events = kcmds[i].events,
> > +			.data = kcmds[i].data,
> > +		};
> > +		if (kcmds[i].flags) {
> > +			kcmds[i].error = ret = -EINVAL;
> 
> To make the 'ret' change a little more obvious, maybe it's better to write
> 
> 			ret = kcmds[i].error = -EINVAL;
> 
> > +			goto out;
> > +		}
> > +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> 
> Likewise:
> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> 
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	if (spec) {
> > +		sigset_t ksigmask;
> > +		struct epoll_wait_spec kspec;
> > +		ktime_t timeout;
> > +
> > +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> 
> Cosmetic point: s/if(/if (/
> 
> > +			return -EFAULT;
> > +		if (kspec.sigmask) {
> > +			if (kspec.sigsetsize != sizeof(sigset_t))
> > +				return -EINVAL;
> > +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > +				return -EFAULT;
> > +		}
> > +		timeout = timespec_to_ktime(kspec.timeout);
> > +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > +				     kspec.clockid, timeout,
> > +				     kspec.sigmask ? &ksigmask : NULL);
> 
> If I understand correctly, the implementation means that the
> 'size_t sigsetsize' field will probably need to be exposed to 
> applications. In the existing epoll_pwait() call (as in  ppoll()
> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
> However, unless we expect glibc to do some structure copying to/from
> a structure that hides this field, then we're going end up exposing
> 'size_t sigsetsize' to applications. (This could be avoided, if we
> split the API as I suggest above. glibc would do the same thing 
> in epoll_pwait1() that it currently does in epoll_pwait().)
> 
> Thanks,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  5:55     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-21  5:55 UTC (permalink / raw)
  To: Andy Lutomirski, Fam Zheng
  Cc: mtk.manpages, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, X86 ML, Alexander Viro, Andrew Morton, Kees Cook,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, Linux FS Devel, Linux API,
	Josh Triplett, Paolo Bonzini

On 01/20/2015 11:40 PM, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz@redhat.com> wrote:
>> This adds a new system call, epoll_mod_wait. It's described as below:

[...]

>>        There is no guartantee that all the commands are executed in order. Only
>>        if all the commands are successfully executed (all the error fields are
>>        set to 0), events are polled.
> 
> If this doesn't happen, what error is returned?

If I read the code correctly: the error of the first epoll_ctl op that fails.

[...]

>> RETURN VALUE
>>
>>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>>        appropriately. All the "error" fields in cmds are unchanged before they
>>        are executed, and if any cmds are executed, the "error" fields are set
>>        to a return code accordingly. See also epoll_ctl for more details of the
>>        return code.
> 
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?

Yes. (Ugly!)

[...]

>> ERRORS
>>
>>        These errors apply on either the return value of epoll_mod_wait or error
>>        status for each command, respectively.
> 
> Please clarify which errors are returned overall and which are per-command.

Yes, I think this would be valuable as well.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  5:55     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-21  5:55 UTC (permalink / raw)
  To: Andy Lutomirski, Fam Zheng
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers

On 01/20/2015 11:40 PM, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> This adds a new system call, epoll_mod_wait. It's described as below:

[...]

>>        There is no guartantee that all the commands are executed in order. Only
>>        if all the commands are successfully executed (all the error fields are
>>        set to 0), events are polled.
> 
> If this doesn't happen, what error is returned?

If I read the code correctly: the error of the first epoll_ctl op that fails.

[...]

>> RETURN VALUE
>>
>>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>>        appropriately. All the "error" fields in cmds are unchanged before they
>>        are executed, and if any cmds are executed, the "error" fields are set
>>        to a return code accordingly. See also epoll_ctl for more details of the
>>        return code.
> 
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?

Yes. (Ugly!)

[...]

>> ERRORS
>>
>>        These errors apply on either the return value of epoll_mod_wait or error
>>        status for each command, respectively.
> 
> Please clarify which errors are returned overall and which are per-command.

Yes, I think this would be valuable as well.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  5:55     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-21  5:55 UTC (permalink / raw)
  To: Andy Lutomirski, Fam Zheng
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers

On 01/20/2015 11:40 PM, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> This adds a new system call, epoll_mod_wait. It's described as below:

[...]

>>        There is no guartantee that all the commands are executed in order. Only
>>        if all the commands are successfully executed (all the error fields are
>>        set to 0), events are polled.
> 
> If this doesn't happen, what error is returned?

If I read the code correctly: the error of the first epoll_ctl op that fails.

[...]

>> RETURN VALUE
>>
>>        When any error occurs, epoll_mod_wait() returns -1 and errno is set
>>        appropriately. All the "error" fields in cmds are unchanged before they
>>        are executed, and if any cmds are executed, the "error" fields are set
>>        to a return code accordingly. See also epoll_ctl for more details of the
>>        return code.
> 
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?

Yes. (Ugly!)

[...]

>> ERRORS
>>
>>        These errors apply on either the return value of epoll_mod_wait or error
>>        status for each command, respectively.
> 
> Please clarify which errors are returned overall and which are per-command.

Yes, I think this would be valuable as well.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  7:52         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-21  7:52 UTC (permalink / raw)
  To: Fam Zheng
  Cc: mtk.manpages, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	Andy Lutomirski, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini

Hello Fam Zheng,

On 01/21/2015 05:59 AM, Fam Zheng wrote:
> On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
>> Hello Fam Zheng,
>>
>> On 01/20/2015 10:57 AM, Fam Zheng wrote:
>>> This syscall is a sequence of
>>>
>>> 1) a number of epoll_ctl calls
>>> 2) a epoll_pwait, with timeout enhancement.
>>>
>>> The epoll_ctl operations are embeded so that application doesn't have to use
>>> separate syscalls to insert/delete/update the fds before poll. It is more
>>> efficient if the set of fds varies from one poll to another, which is the
>>> common pattern for certain applications. 
>>
>> Which applications? Could we have some specific examples? This is a 
>> complex API, and it needs good justification.
> 
> OK, I'll explain more in v2.

Okay.

[...]

>>> The only complexity is returning the result of each operation.  For each
>>> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
>>> the return code *iff* the command is executed (0 for success and -errno of the
>>> equivalent epoll_ctl call), and will be left unchanged if the command is not
>>> executed because some earlier error, for example due to failure of
>>> copy_from_user to copy the array.
>>>
>>> Applications can utilize this fact to do error handling: they could initialize
>>> all the epoll_mod_wait.error to a positive value, which is by definition not a
>>> possible output value from epoll_mod_wait. Then when the syscall returned, they
>>> know whether or not the command is executed by comparing each error with the
>>> init value, if they're different, they have the result of the command.
>>> More roughly, they can put any non-zero and not distinguish "not run" from
>>> failure.
>>
>> The "cmds' are not executed in a specified order plus the need to
>> initialize the 'errors' fields to a positive value feels a bit ugly.
>> And indeed the whole "command list was only partially run" case
>> is not pretty. Am I correct to understand that if an error is found
>> during execution of one of the "epoll_ctl" commands in 'cmds' then
>> the system call will return -1 with errno set, indicating an error,
>> even though the epoll interest list may have changed because some
>> of the earlier 'cmds' executed successfully? This all seems a bit of
>> a headache for user space.
> 
> This is the trade-off for batching. The best we can do is probably make this
> transactional: none or all of the commands succeeds. It will require a much
> more complex implementation, though. 

Transactional would be more comfortable for user-space, and while I 
can see that it would be complex to implement, perhaps the greater
point might be that the implementation is CPU expensive.

> But even with that, the error reporting on
> which command failed is a complication.

My suggestions below could make the error reporting much simpler...

>> I have a couple of questions:
>>
>> Q1. I can see that batching "epoll_ctl" commands might be useful,
>> since it results in fewer systems calls. But, does it really
>> need to be bound together with the "epoll_pwait" functionality?
>> (Perhaps this point was covered in previous discussions, but
>> neither the message accompanying this patch nor the 0/6 man page
>> provide a compelling rationale for the need to bind these two
>> operations together.)
>>
>> Yes, I realize you might save a system call, but it makes for a
>> cumbersome API that has the above headache, and also forces the 
>> need for double pointer indirection in the 'spec' argument (i.e., 
>> spec is a pointer to an array of structures where each element
>> in turn includes an 'events' pointer that points to another array).
>>
>> Why not a simpler API with two syscalls such as:
>>
>> epoll_ctl_batch(int epfd, int flags,
>>                 int ncmds, struct epoll_mod_cmd *cmds);
>>
>> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>>              struct timespec *timeout, int clock_id, 
>>              const sigset_t *sigmask, size_t sigsetsize);
> 
> The problem is that there is no room for flags field in epoll_pwait1, which is
> asked for, in previous discussion thread [1].

Ahh yes, I certainly should not have forgotten that. But that's easily solved.
Do as for pselect6():

strcut sigargs {
    const sigset_t *ss;
    size_t          ss_len; /* Size (in bytes) of object pointed
                               to by 'ss' */
}

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, int flags,
             int timeout,
             const struct sigargs *sargs);

> I don't see epoll_mod_wait as a *significantly more* complicated interface
> compared to epoll_ctl_batch and epoll_pwait1 above. 

My biggest problem with epoll_ctl_wait() is the complexity of error 
handling. epoll_ctl_batch and epoll_pwait1 would simplify that, as I 
note below.

Aside from that, I do think that epoll_ctl_wait() passes a certain
threshold of complexity that warrants good justification, which
I haven't really seen yet.

> In epoll_mod_wait, if you
> leave out ncmds and cmds, it is effectively a poll without batch; and if
> leaving out spec, it is effectively a batch without poll.
> 
> The most important change here is the timeout. IMO I wouldn't mind leaving out
> batching. Integrating it is requested by Andy:
> 
> [1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591
> 
> which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
> number of times right before epoll_wait.
> 
> [Sorry for not putting anything into cover letter changelog, but it is also
> interesting to see people's reaction on the patch itself without bias of
> others' opinions. This indeed brings in more points. :]

But it also has the downside that the same discussions
may be repeated.

>> This gives us much of the benefit of reducing system calls, but 
>> with greater simplicity. And epoll_ctl_batch() could simply return
>> the number of 'cmds' that were successfully executed.)
>>
>> Q2. In the man page in 0/6 you said that the 'cmds' were not 
>> guaranteed to be executed in order. Why not? If you did provide
>> such a guarantee, then, when using your current epoll_mod_wait(),
>> user space could do the following:
> 
> I guess we can make a guarentee on that.

I'm puzzled by that response. Surely you *must* guarantee it.
If there's no defined order, then if the batch includes multiple 
commands that operate on the same FD, the result is undefined 
unless you provide that guarantee. (Unless, of course, you want to
explicitly specify that using the same FD multiple times in the
batch gives undefined behavior.)

>> 1. Initialize the cmd.errors fields to zero.
>> 2. Call epoll_ctl_mod()
>> 3. Iterate through cmd.errors looking for the first nonzero 
>>    field.
> 
> It's close, but zero is not good enough, if copy_from_user of cmds failed in
> the first place. Impossible value, or error value, will be safer.

See my comment in the earlier mail. If you split this into two 
APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
then the return value of epoll_ctl_batch() could be used to tell
user space how many commands succeeded. Much simpler!

>>> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
>>> scalar. This provides higher precision. 
>>
>> Yes, that change seemed inevitable. It slightly puzzled me at the time when
>> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
>> though pselect() already had demonstrated the need for higher precision.
>> I should have called it out way back then :-{.
>>
>>> The parameter field in struct
>>> epoll_wait_spec, "clockid", also makes it possible for users to use a different
>>> clock than the default when it makes more sense.
>>>
>>> Signed-off-by: Fam Zheng <famz@redhat.com>
>>> ---
>>>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/syscalls.h |  5 ++++
>>>  2 files changed, 65 insertions(+)
>>>
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index e7a116d..2cc22c9 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>>>  			      sigmask ? &ksigmask : NULL);
>>>  }
>>>  
>>> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
>>> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
>>> +		struct epoll_wait_spec __user *, spec)
>>> +{
>>> +	struct epoll_mod_cmd *kcmds = NULL;
>>> +	int i, ret = 0;
>>> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
>>> +
>>> +	if (flags)
>>> +		return -EINVAL;
>>> +	if (ncmds) {
>>> +		if (!cmds)
>>> +			return -EINVAL;
>>> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
>>> +		if (!kcmds)
>>> +			return -ENOMEM;
>>> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
>>> +			ret = -EFAULT;
>>> +			goto out;
>>> +		}
>>> +	}
>>> +	for (i = 0; i < ncmds; i++) {
>>> +		struct epoll_event ev = (struct epoll_event) {
>>> +			.events = kcmds[i].events,
>>> +			.data = kcmds[i].data,
>>> +		};
>>> +		if (kcmds[i].flags) {
>>> +			kcmds[i].error = ret = -EINVAL;
>>
>> To make the 'ret' change a little more obvious, maybe it's better to write
>>
>> 			ret = kcmds[i].error = -EINVAL;
>>
>>> +			goto out;
>>> +		}
>>> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>>
>> Likewise:
>> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>>
>>> +		if (ret)
>>> +			goto out;
>>> +	}
>>> +	if (spec) {
>>> +		sigset_t ksigmask;
>>> +		struct epoll_wait_spec kspec;
>>> +		ktime_t timeout;
>>> +
>>> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
>>
>> Cosmetic point: s/if(/if (/
>>
>>> +			return -EFAULT;
>>> +		if (kspec.sigmask) {
>>> +			if (kspec.sigsetsize != sizeof(sigset_t))
>>> +				return -EINVAL;
>>> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
>>> +				return -EFAULT;
>>> +		}
>>> +		timeout = timespec_to_ktime(kspec.timeout);
>>> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
>>> +				     kspec.clockid, timeout,
>>> +				     kspec.sigmask ? &ksigmask : NULL);
>>
>> If I understand correctly, the implementation means that the
>> 'size_t sigsetsize' field will probably need to be exposed to 
>> applications. In the existing epoll_pwait() call (as in  ppoll()
>> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
>> However, unless we expect glibc to do some structure copying to/from
>> a structure that hides this field, then we're going end up exposing
>> 'size_t sigsetsize' to applications. (This could be avoided, if we
>> split the API as I suggest above. glibc would do the same thing 
>> in epoll_pwait1() that it currently does in epoll_pwait().)

You missed responding to this point; I think it matters.
(There were also some other points to consider in my reply 
to your 0/6 mail.)

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  7:52         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-21  7:52 UTC (permalink / raw)
  To: Fam Zheng
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers

Hello Fam Zheng,

On 01/21/2015 05:59 AM, Fam Zheng wrote:
> On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
>> Hello Fam Zheng,
>>
>> On 01/20/2015 10:57 AM, Fam Zheng wrote:
>>> This syscall is a sequence of
>>>
>>> 1) a number of epoll_ctl calls
>>> 2) a epoll_pwait, with timeout enhancement.
>>>
>>> The epoll_ctl operations are embeded so that application doesn't have to use
>>> separate syscalls to insert/delete/update the fds before poll. It is more
>>> efficient if the set of fds varies from one poll to another, which is the
>>> common pattern for certain applications. 
>>
>> Which applications? Could we have some specific examples? This is a 
>> complex API, and it needs good justification.
> 
> OK, I'll explain more in v2.

Okay.

[...]

>>> The only complexity is returning the result of each operation.  For each
>>> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
>>> the return code *iff* the command is executed (0 for success and -errno of the
>>> equivalent epoll_ctl call), and will be left unchanged if the command is not
>>> executed because some earlier error, for example due to failure of
>>> copy_from_user to copy the array.
>>>
>>> Applications can utilize this fact to do error handling: they could initialize
>>> all the epoll_mod_wait.error to a positive value, which is by definition not a
>>> possible output value from epoll_mod_wait. Then when the syscall returned, they
>>> know whether or not the command is executed by comparing each error with the
>>> init value, if they're different, they have the result of the command.
>>> More roughly, they can put any non-zero and not distinguish "not run" from
>>> failure.
>>
>> The "cmds' are not executed in a specified order plus the need to
>> initialize the 'errors' fields to a positive value feels a bit ugly.
>> And indeed the whole "command list was only partially run" case
>> is not pretty. Am I correct to understand that if an error is found
>> during execution of one of the "epoll_ctl" commands in 'cmds' then
>> the system call will return -1 with errno set, indicating an error,
>> even though the epoll interest list may have changed because some
>> of the earlier 'cmds' executed successfully? This all seems a bit of
>> a headache for user space.
> 
> This is the trade-off for batching. The best we can do is probably make this
> transactional: none or all of the commands succeeds. It will require a much
> more complex implementation, though. 

Transactional would be more comfortable for user-space, and while I 
can see that it would be complex to implement, perhaps the greater
point might be that the implementation is CPU expensive.

> But even with that, the error reporting on
> which command failed is a complication.

My suggestions below could make the error reporting much simpler...

>> I have a couple of questions:
>>
>> Q1. I can see that batching "epoll_ctl" commands might be useful,
>> since it results in fewer systems calls. But, does it really
>> need to be bound together with the "epoll_pwait" functionality?
>> (Perhaps this point was covered in previous discussions, but
>> neither the message accompanying this patch nor the 0/6 man page
>> provide a compelling rationale for the need to bind these two
>> operations together.)
>>
>> Yes, I realize you might save a system call, but it makes for a
>> cumbersome API that has the above headache, and also forces the 
>> need for double pointer indirection in the 'spec' argument (i.e., 
>> spec is a pointer to an array of structures where each element
>> in turn includes an 'events' pointer that points to another array).
>>
>> Why not a simpler API with two syscalls such as:
>>
>> epoll_ctl_batch(int epfd, int flags,
>>                 int ncmds, struct epoll_mod_cmd *cmds);
>>
>> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>>              struct timespec *timeout, int clock_id, 
>>              const sigset_t *sigmask, size_t sigsetsize);
> 
> The problem is that there is no room for flags field in epoll_pwait1, which is
> asked for, in previous discussion thread [1].

Ahh yes, I certainly should not have forgotten that. But that's easily solved.
Do as for pselect6():

strcut sigargs {
    const sigset_t *ss;
    size_t          ss_len; /* Size (in bytes) of object pointed
                               to by 'ss' */
}

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, int flags,
             int timeout,
             const struct sigargs *sargs);

> I don't see epoll_mod_wait as a *significantly more* complicated interface
> compared to epoll_ctl_batch and epoll_pwait1 above. 

My biggest problem with epoll_ctl_wait() is the complexity of error 
handling. epoll_ctl_batch and epoll_pwait1 would simplify that, as I 
note below.

Aside from that, I do think that epoll_ctl_wait() passes a certain
threshold of complexity that warrants good justification, which
I haven't really seen yet.

> In epoll_mod_wait, if you
> leave out ncmds and cmds, it is effectively a poll without batch; and if
> leaving out spec, it is effectively a batch without poll.
> 
> The most important change here is the timeout. IMO I wouldn't mind leaving out
> batching. Integrating it is requested by Andy:
> 
> [1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591
> 
> which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
> number of times right before epoll_wait.
> 
> [Sorry for not putting anything into cover letter changelog, but it is also
> interesting to see people's reaction on the patch itself without bias of
> others' opinions. This indeed brings in more points. :]

But it also has the downside that the same discussions
may be repeated.

>> This gives us much of the benefit of reducing system calls, but 
>> with greater simplicity. And epoll_ctl_batch() could simply return
>> the number of 'cmds' that were successfully executed.)
>>
>> Q2. In the man page in 0/6 you said that the 'cmds' were not 
>> guaranteed to be executed in order. Why not? If you did provide
>> such a guarantee, then, when using your current epoll_mod_wait(),
>> user space could do the following:
> 
> I guess we can make a guarentee on that.

I'm puzzled by that response. Surely you *must* guarantee it.
If there's no defined order, then if the batch includes multiple 
commands that operate on the same FD, the result is undefined 
unless you provide that guarantee. (Unless, of course, you want to
explicitly specify that using the same FD multiple times in the
batch gives undefined behavior.)

>> 1. Initialize the cmd.errors fields to zero.
>> 2. Call epoll_ctl_mod()
>> 3. Iterate through cmd.errors looking for the first nonzero 
>>    field.
> 
> It's close, but zero is not good enough, if copy_from_user of cmds failed in
> the first place. Impossible value, or error value, will be safer.

See my comment in the earlier mail. If you split this into two 
APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
then the return value of epoll_ctl_batch() could be used to tell
user space how many commands succeeded. Much simpler!

>>> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
>>> scalar. This provides higher precision. 
>>
>> Yes, that change seemed inevitable. It slightly puzzled me at the time when
>> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
>> though pselect() already had demonstrated the need for higher precision.
>> I should have called it out way back then :-{.
>>
>>> The parameter field in struct
>>> epoll_wait_spec, "clockid", also makes it possible for users to use a different
>>> clock than the default when it makes more sense.
>>>
>>> Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/syscalls.h |  5 ++++
>>>  2 files changed, 65 insertions(+)
>>>
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index e7a116d..2cc22c9 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>>>  			      sigmask ? &ksigmask : NULL);
>>>  }
>>>  
>>> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
>>> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
>>> +		struct epoll_wait_spec __user *, spec)
>>> +{
>>> +	struct epoll_mod_cmd *kcmds = NULL;
>>> +	int i, ret = 0;
>>> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
>>> +
>>> +	if (flags)
>>> +		return -EINVAL;
>>> +	if (ncmds) {
>>> +		if (!cmds)
>>> +			return -EINVAL;
>>> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
>>> +		if (!kcmds)
>>> +			return -ENOMEM;
>>> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
>>> +			ret = -EFAULT;
>>> +			goto out;
>>> +		}
>>> +	}
>>> +	for (i = 0; i < ncmds; i++) {
>>> +		struct epoll_event ev = (struct epoll_event) {
>>> +			.events = kcmds[i].events,
>>> +			.data = kcmds[i].data,
>>> +		};
>>> +		if (kcmds[i].flags) {
>>> +			kcmds[i].error = ret = -EINVAL;
>>
>> To make the 'ret' change a little more obvious, maybe it's better to write
>>
>> 			ret = kcmds[i].error = -EINVAL;
>>
>>> +			goto out;
>>> +		}
>>> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>>
>> Likewise:
>> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>>
>>> +		if (ret)
>>> +			goto out;
>>> +	}
>>> +	if (spec) {
>>> +		sigset_t ksigmask;
>>> +		struct epoll_wait_spec kspec;
>>> +		ktime_t timeout;
>>> +
>>> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
>>
>> Cosmetic point: s/if(/if (/
>>
>>> +			return -EFAULT;
>>> +		if (kspec.sigmask) {
>>> +			if (kspec.sigsetsize != sizeof(sigset_t))
>>> +				return -EINVAL;
>>> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
>>> +				return -EFAULT;
>>> +		}
>>> +		timeout = timespec_to_ktime(kspec.timeout);
>>> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
>>> +				     kspec.clockid, timeout,
>>> +				     kspec.sigmask ? &ksigmask : NULL);
>>
>> If I understand correctly, the implementation means that the
>> 'size_t sigsetsize' field will probably need to be exposed to 
>> applications. In the existing epoll_pwait() call (as in  ppoll()
>> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
>> However, unless we expect glibc to do some structure copying to/from
>> a structure that hides this field, then we're going end up exposing
>> 'size_t sigsetsize' to applications. (This could be avoided, if we
>> split the API as I suggest above. glibc would do the same thing 
>> in epoll_pwait1() that it currently does in epoll_pwait().)

You missed responding to this point; I think it matters.
(There were also some other points to consider in my reply 
to your 0/6 mail.)

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  7:52         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 80+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-21  7:52 UTC (permalink / raw)
  To: Fam Zheng
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers

Hello Fam Zheng,

On 01/21/2015 05:59 AM, Fam Zheng wrote:
> On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
>> Hello Fam Zheng,
>>
>> On 01/20/2015 10:57 AM, Fam Zheng wrote:
>>> This syscall is a sequence of
>>>
>>> 1) a number of epoll_ctl calls
>>> 2) a epoll_pwait, with timeout enhancement.
>>>
>>> The epoll_ctl operations are embeded so that application doesn't have to use
>>> separate syscalls to insert/delete/update the fds before poll. It is more
>>> efficient if the set of fds varies from one poll to another, which is the
>>> common pattern for certain applications. 
>>
>> Which applications? Could we have some specific examples? This is a 
>> complex API, and it needs good justification.
> 
> OK, I'll explain more in v2.

Okay.

[...]

>>> The only complexity is returning the result of each operation.  For each
>>> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
>>> the return code *iff* the command is executed (0 for success and -errno of the
>>> equivalent epoll_ctl call), and will be left unchanged if the command is not
>>> executed because some earlier error, for example due to failure of
>>> copy_from_user to copy the array.
>>>
>>> Applications can utilize this fact to do error handling: they could initialize
>>> all the epoll_mod_wait.error to a positive value, which is by definition not a
>>> possible output value from epoll_mod_wait. Then when the syscall returned, they
>>> know whether or not the command is executed by comparing each error with the
>>> init value, if they're different, they have the result of the command.
>>> More roughly, they can put any non-zero and not distinguish "not run" from
>>> failure.
>>
>> The "cmds' are not executed in a specified order plus the need to
>> initialize the 'errors' fields to a positive value feels a bit ugly.
>> And indeed the whole "command list was only partially run" case
>> is not pretty. Am I correct to understand that if an error is found
>> during execution of one of the "epoll_ctl" commands in 'cmds' then
>> the system call will return -1 with errno set, indicating an error,
>> even though the epoll interest list may have changed because some
>> of the earlier 'cmds' executed successfully? This all seems a bit of
>> a headache for user space.
> 
> This is the trade-off for batching. The best we can do is probably make this
> transactional: none or all of the commands succeeds. It will require a much
> more complex implementation, though. 

Transactional would be more comfortable for user-space, and while I 
can see that it would be complex to implement, perhaps the greater
point might be that the implementation is CPU expensive.

> But even with that, the error reporting on
> which command failed is a complication.

My suggestions below could make the error reporting much simpler...

>> I have a couple of questions:
>>
>> Q1. I can see that batching "epoll_ctl" commands might be useful,
>> since it results in fewer systems calls. But, does it really
>> need to be bound together with the "epoll_pwait" functionality?
>> (Perhaps this point was covered in previous discussions, but
>> neither the message accompanying this patch nor the 0/6 man page
>> provide a compelling rationale for the need to bind these two
>> operations together.)
>>
>> Yes, I realize you might save a system call, but it makes for a
>> cumbersome API that has the above headache, and also forces the 
>> need for double pointer indirection in the 'spec' argument (i.e., 
>> spec is a pointer to an array of structures where each element
>> in turn includes an 'events' pointer that points to another array).
>>
>> Why not a simpler API with two syscalls such as:
>>
>> epoll_ctl_batch(int epfd, int flags,
>>                 int ncmds, struct epoll_mod_cmd *cmds);
>>
>> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>>              struct timespec *timeout, int clock_id, 
>>              const sigset_t *sigmask, size_t sigsetsize);
> 
> The problem is that there is no room for flags field in epoll_pwait1, which is
> asked for, in previous discussion thread [1].

Ahh yes, I certainly should not have forgotten that. But that's easily solved.
Do as for pselect6():

strcut sigargs {
    const sigset_t *ss;
    size_t          ss_len; /* Size (in bytes) of object pointed
                               to by 'ss' */
}

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, int flags,
             int timeout,
             const struct sigargs *sargs);

> I don't see epoll_mod_wait as a *significantly more* complicated interface
> compared to epoll_ctl_batch and epoll_pwait1 above. 

My biggest problem with epoll_ctl_wait() is the complexity of error 
handling. epoll_ctl_batch and epoll_pwait1 would simplify that, as I 
note below.

Aside from that, I do think that epoll_ctl_wait() passes a certain
threshold of complexity that warrants good justification, which
I haven't really seen yet.

> In epoll_mod_wait, if you
> leave out ncmds and cmds, it is effectively a poll without batch; and if
> leaving out spec, it is effectively a batch without poll.
> 
> The most important change here is the timeout. IMO I wouldn't mind leaving out
> batching. Integrating it is requested by Andy:
> 
> [1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591
> 
> which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
> number of times right before epoll_wait.
> 
> [Sorry for not putting anything into cover letter changelog, but it is also
> interesting to see people's reaction on the patch itself without bias of
> others' opinions. This indeed brings in more points. :]

But it also has the downside that the same discussions
may be repeated.

>> This gives us much of the benefit of reducing system calls, but 
>> with greater simplicity. And epoll_ctl_batch() could simply return
>> the number of 'cmds' that were successfully executed.)
>>
>> Q2. In the man page in 0/6 you said that the 'cmds' were not 
>> guaranteed to be executed in order. Why not? If you did provide
>> such a guarantee, then, when using your current epoll_mod_wait(),
>> user space could do the following:
> 
> I guess we can make a guarentee on that.

I'm puzzled by that response. Surely you *must* guarantee it.
If there's no defined order, then if the batch includes multiple 
commands that operate on the same FD, the result is undefined 
unless you provide that guarantee. (Unless, of course, you want to
explicitly specify that using the same FD multiple times in the
batch gives undefined behavior.)

>> 1. Initialize the cmd.errors fields to zero.
>> 2. Call epoll_ctl_mod()
>> 3. Iterate through cmd.errors looking for the first nonzero 
>>    field.
> 
> It's close, but zero is not good enough, if copy_from_user of cmds failed in
> the first place. Impossible value, or error value, will be safer.

See my comment in the earlier mail. If you split this into two 
APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
then the return value of epoll_ctl_batch() could be used to tell
user space how many commands succeeded. Much simpler!

>>> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
>>> scalar. This provides higher precision. 
>>
>> Yes, that change seemed inevitable. It slightly puzzled me at the time when
>> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
>> though pselect() already had demonstrated the need for higher precision.
>> I should have called it out way back then :-{.
>>
>>> The parameter field in struct
>>> epoll_wait_spec, "clockid", also makes it possible for users to use a different
>>> clock than the default when it makes more sense.
>>>
>>> Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/syscalls.h |  5 ++++
>>>  2 files changed, 65 insertions(+)
>>>
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index e7a116d..2cc22c9 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>>>  			      sigmask ? &ksigmask : NULL);
>>>  }
>>>  
>>> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
>>> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
>>> +		struct epoll_wait_spec __user *, spec)
>>> +{
>>> +	struct epoll_mod_cmd *kcmds = NULL;
>>> +	int i, ret = 0;
>>> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
>>> +
>>> +	if (flags)
>>> +		return -EINVAL;
>>> +	if (ncmds) {
>>> +		if (!cmds)
>>> +			return -EINVAL;
>>> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
>>> +		if (!kcmds)
>>> +			return -ENOMEM;
>>> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
>>> +			ret = -EFAULT;
>>> +			goto out;
>>> +		}
>>> +	}
>>> +	for (i = 0; i < ncmds; i++) {
>>> +		struct epoll_event ev = (struct epoll_event) {
>>> +			.events = kcmds[i].events,
>>> +			.data = kcmds[i].data,
>>> +		};
>>> +		if (kcmds[i].flags) {
>>> +			kcmds[i].error = ret = -EINVAL;
>>
>> To make the 'ret' change a little more obvious, maybe it's better to write
>>
>> 			ret = kcmds[i].error = -EINVAL;
>>
>>> +			goto out;
>>> +		}
>>> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>>
>> Likewise:
>> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
>>
>>> +		if (ret)
>>> +			goto out;
>>> +	}
>>> +	if (spec) {
>>> +		sigset_t ksigmask;
>>> +		struct epoll_wait_spec kspec;
>>> +		ktime_t timeout;
>>> +
>>> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
>>
>> Cosmetic point: s/if(/if (/
>>
>>> +			return -EFAULT;
>>> +		if (kspec.sigmask) {
>>> +			if (kspec.sigsetsize != sizeof(sigset_t))
>>> +				return -EINVAL;
>>> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
>>> +				return -EFAULT;
>>> +		}
>>> +		timeout = timespec_to_ktime(kspec.timeout);
>>> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
>>> +				     kspec.clockid, timeout,
>>> +				     kspec.sigmask ? &ksigmask : NULL);
>>
>> If I understand correctly, the implementation means that the
>> 'size_t sigsetsize' field will probably need to be exposed to 
>> applications. In the existing epoll_pwait() call (as in  ppoll()
>> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
>> However, unless we expect glibc to do some structure copying to/from
>> a structure that hides this field, then we're going end up exposing
>> 'size_t sigsetsize' to applications. (This could be avoided, if we
>> split the API as I suggest above. glibc would do the same thing 
>> in epoll_pwait1() that it currently does in epoll_pwait().)

You missed responding to this point; I think it matters.
(There were also some other points to consider in my reply 
to your 0/6 mail.)

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  7:56     ` Omar Sandoval
  0 siblings, 0 replies; 80+ messages in thread
From: Omar Sandoval @ 2015-01-21  7:56 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> 
> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.
> 
> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;
> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> +			return -EFAULT;
This should probably be goto out, or you'll leak kcmds.

> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
Same here...

> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
and here.

> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);
> +	}
> +
> +out:
> +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> +		return -EFAULT;
This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
to lead to a weird corner case: if cmds is read-only, we'll end up executing
every command but fail to copy out the return values, so when userspace gets the
EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
means you're probably doing something wrong anyways, so maybe not the biggest
concern.

> +	kfree(kcmds);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
>  		       struct epoll_event __user *, events,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 85893d7..7156c80 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -12,6 +12,8 @@
>  #define _LINUX_SYSCALLS_H
>  
>  struct epoll_event;
> +struct epoll_mod_cmd;
> +struct epoll_wait_spec;
>  struct iattr;
>  struct inode;
>  struct iocb;
> @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
>  				int maxevents, int timeout,
>  				const sigset_t __user *sigmask,
>  				size_t sigsetsize);
> +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> +				   struct epoll_wait_spec __user * spec);
>  asmlinkage long sys_gethostname(char __user *name, int len);
>  asmlinkage long sys_sethostname(char __user *name, int len);
>  asmlinkage long sys_setdomainname(char __user *name, int len);
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Omar

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  7:56     ` Omar Sandoval
  0 siblings, 0 replies; 80+ messages in thread
From: Omar Sandoval @ 2015-01-21  7:56 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> 
> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.
> 
> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;
> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> +			return -EFAULT;
This should probably be goto out, or you'll leak kcmds.

> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
Same here...

> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
and here.

> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);
> +	}
> +
> +out:
> +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> +		return -EFAULT;
This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
to lead to a weird corner case: if cmds is read-only, we'll end up executing
every command but fail to copy out the return values, so when userspace gets the
EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
means you're probably doing something wrong anyways, so maybe not the biggest
concern.

> +	kfree(kcmds);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
>  		       struct epoll_event __user *, events,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 85893d7..7156c80 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -12,6 +12,8 @@
>  #define _LINUX_SYSCALLS_H
>  
>  struct epoll_event;
> +struct epoll_mod_cmd;
> +struct epoll_wait_spec;
>  struct iattr;
>  struct inode;
>  struct iocb;
> @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
>  				int maxevents, int timeout,
>  				const sigset_t __user *sigmask,
>  				size_t sigsetsize);
> +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> +				   struct epoll_wait_spec __user * spec);
>  asmlinkage long sys_gethostname(char __user *name, int len);
>  asmlinkage long sys_sethostname(char __user *name, int len);
>  asmlinkage long sys_setdomainname(char __user *name, int len);
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Omar

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  7:56     ` Omar Sandoval
  0 siblings, 0 replies; 80+ messages in thread
From: Omar Sandoval @ 2015-01-21  7:56 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> 
> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.
> 
> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;
> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> +			return -EFAULT;
This should probably be goto out, or you'll leak kcmds.

> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
Same here...

> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
and here.

> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);
> +	}
> +
> +out:
> +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> +		return -EFAULT;
This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
to lead to a weird corner case: if cmds is read-only, we'll end up executing
every command but fail to copy out the return values, so when userspace gets the
EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
means you're probably doing something wrong anyways, so maybe not the biggest
concern.

> +	kfree(kcmds);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
>  		       struct epoll_event __user *, events,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 85893d7..7156c80 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -12,6 +12,8 @@
>  #define _LINUX_SYSCALLS_H
>  
>  struct epoll_event;
> +struct epoll_mod_cmd;
> +struct epoll_wait_spec;
>  struct iattr;
>  struct inode;
>  struct iocb;
> @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
>  				int maxevents, int timeout,
>  				const sigset_t __user *sigmask,
>  				size_t sigsetsize);
> +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> +				   struct epoll_wait_spec __user * spec);
>  asmlinkage long sys_gethostname(char __user *name, int len);
>  asmlinkage long sys_sethostname(char __user *name, int len);
>  asmlinkage long sys_setdomainname(char __user *name, int len);
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Omar

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
  2015-01-21  7:52         ` Michael Kerrisk (man-pages)
@ 2015-01-21  8:58           ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  8:58 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Andy Lutomirski
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	linux-fsdevel, linux-api, Josh Triplett, Paolo Bonzini

On Wed, 01/21 08:52, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 01/21/2015 05:59 AM, Fam Zheng wrote:
> > On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
> >> Hello Fam Zheng,
> >>
> >> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> >>> This syscall is a sequence of
> >>>
> >>> 1) a number of epoll_ctl calls
> >>> 2) a epoll_pwait, with timeout enhancement.
> >>>
> >>> The epoll_ctl operations are embeded so that application doesn't have to use
> >>> separate syscalls to insert/delete/update the fds before poll. It is more
> >>> efficient if the set of fds varies from one poll to another, which is the
> >>> common pattern for certain applications. 
> >>
> >> Which applications? Could we have some specific examples? This is a 
> >> complex API, and it needs good justification.
> > 
> > OK, I'll explain more in v2.
> 
> Okay.
> 
> [...]
> 
> >>> The only complexity is returning the result of each operation.  For each
> >>> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> >>> the return code *iff* the command is executed (0 for success and -errno of the
> >>> equivalent epoll_ctl call), and will be left unchanged if the command is not
> >>> executed because some earlier error, for example due to failure of
> >>> copy_from_user to copy the array.
> >>>
> >>> Applications can utilize this fact to do error handling: they could initialize
> >>> all the epoll_mod_wait.error to a positive value, which is by definition not a
> >>> possible output value from epoll_mod_wait. Then when the syscall returned, they
> >>> know whether or not the command is executed by comparing each error with the
> >>> init value, if they're different, they have the result of the command.
> >>> More roughly, they can put any non-zero and not distinguish "not run" from
> >>> failure.
> >>
> >> The "cmds' are not executed in a specified order plus the need to
> >> initialize the 'errors' fields to a positive value feels a bit ugly.
> >> And indeed the whole "command list was only partially run" case
> >> is not pretty. Am I correct to understand that if an error is found
> >> during execution of one of the "epoll_ctl" commands in 'cmds' then
> >> the system call will return -1 with errno set, indicating an error,
> >> even though the epoll interest list may have changed because some
> >> of the earlier 'cmds' executed successfully? This all seems a bit of
> >> a headache for user space.
> > 
> > This is the trade-off for batching. The best we can do is probably make this
> > transactional: none or all of the commands succeeds. It will require a much
> > more complex implementation, though. 
> 
> Transactional would be more comfortable for user-space, and while I 
> can see that it would be complex to implement, perhaps the greater
> point might be that the implementation is CPU expensive.

Good point.

> 
> > But even with that, the error reporting on
> > which command failed is a complication.
> 
> My suggestions below could make the error reporting much simpler...
> 
> >> I have a couple of questions:
> >>
> >> Q1. I can see that batching "epoll_ctl" commands might be useful,
> >> since it results in fewer systems calls. But, does it really
> >> need to be bound together with the "epoll_pwait" functionality?
> >> (Perhaps this point was covered in previous discussions, but
> >> neither the message accompanying this patch nor the 0/6 man page
> >> provide a compelling rationale for the need to bind these two
> >> operations together.)
> >>
> >> Yes, I realize you might save a system call, but it makes for a
> >> cumbersome API that has the above headache, and also forces the 
> >> need for double pointer indirection in the 'spec' argument (i.e., 
> >> spec is a pointer to an array of structures where each element
> >> in turn includes an 'events' pointer that points to another array).
> >>
> >> Why not a simpler API with two syscalls such as:
> >>
> >> epoll_ctl_batch(int epfd, int flags,
> >>                 int ncmds, struct epoll_mod_cmd *cmds);
> >>
> >> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
> >>              struct timespec *timeout, int clock_id, 
> >>              const sigset_t *sigmask, size_t sigsetsize);
> > 
> > The problem is that there is no room for flags field in epoll_pwait1, which is
> > asked for, in previous discussion thread [1].
> 
> Ahh yes, I certainly should not have forgotten that. But that's easily solved.
> Do as for pselect6():
> 
> strcut sigargs {
>     const sigset_t *ss;
>     size_t          ss_len; /* Size (in bytes) of object pointed
>                                to by 'ss' */
> }
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, int flags,
>              int timeout,
>              const struct sigargs *sargs);
> 
> > I don't see epoll_mod_wait as a *significantly more* complicated interface
> > compared to epoll_ctl_batch and epoll_pwait1 above. 
> 
> My biggest problem with epoll_ctl_wait() is the complexity of error 
> handling. epoll_ctl_batch and epoll_pwait1 would simplify that, as I 
> note below.
> 
> Aside from that, I do think that epoll_ctl_wait() passes a certain
> threshold of complexity that warrants good justification, which
> I haven't really seen yet.

OK, see below.

> 
> > In epoll_mod_wait, if you
> > leave out ncmds and cmds, it is effectively a poll without batch; and if
> > leaving out spec, it is effectively a batch without poll.
> > 
> > The most important change here is the timeout. IMO I wouldn't mind leaving out
> > batching. Integrating it is requested by Andy:
> > 
> > [1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591
> > 
> > which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
> > number of times right before epoll_wait.
> > 
> > [Sorry for not putting anything into cover letter changelog, but it is also
> > interesting to see people's reaction on the patch itself without bias of
> > others' opinions. This indeed brings in more points. :]
> 
> But it also has the downside that the same discussions
> may be repeated.

Yes, that's why I definitely do it for any version that is > v1.

> 
> >> This gives us much of the benefit of reducing system calls, but 
> >> with greater simplicity. And epoll_ctl_batch() could simply return
> >> the number of 'cmds' that were successfully executed.)
> >>
> >> Q2. In the man page in 0/6 you said that the 'cmds' were not 
> >> guaranteed to be executed in order. Why not? If you did provide
> >> such a guarantee, then, when using your current epoll_mod_wait(),
> >> user space could do the following:
> > 
> > I guess we can make a guarentee on that.
> 
> I'm puzzled by that response. Surely you *must* guarantee it.
> If there's no defined order, then if the batch includes multiple 
> commands that operate on the same FD, the result is undefined 
> unless you provide that guarantee. (Unless, of course, you want to
> explicitly specify that using the same FD multiple times in the
> batch gives undefined behavior.)

OK. Then let's guarentee it.

> 
> >> 1. Initialize the cmd.errors fields to zero.
> >> 2. Call epoll_ctl_mod()
> >> 3. Iterate through cmd.errors looking for the first nonzero 
> >>    field.
> > 
> > It's close, but zero is not good enough, if copy_from_user of cmds failed in
> > the first place. Impossible value, or error value, will be safer.
> 
> See my comment in the earlier mail. If you split this into two 
> APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
> then the return value of epoll_ctl_batch() could be used to tell
> user space how many commands succeeded. Much simpler!

Yes it is much simpler. However the reason to add batching in the first place is
to make epoll faster, by reducing syscalls. Splitting makes the result
sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
proposed new call *is* a step forward, but I don't think we will have everything
solved even by implementing them all. Compromise needed between performance or
complexity.

My take for simplicity will be leaving epoll_ctl as-is, and my take for
performance will be epoll_pwait1. And I don't really like putting my time on
epoll_ctl_batch, thinking it as a ambivalent compromise in between.

Andy, will you be OK with the above epoll_pwait1? Or do you have more
justifications for epoll_mod_and_wait?

> 
> >>> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> >>> scalar. This provides higher precision. 
> >>
> >> Yes, that change seemed inevitable. It slightly puzzled me at the time when
> >> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
> >> though pselect() already had demonstrated the need for higher precision.
> >> I should have called it out way back then :-{.
> >>
> >>> The parameter field in struct
> >>> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> >>> clock than the default when it makes more sense.
> >>>
> >>> Signed-off-by: Fam Zheng <famz@redhat.com>
> >>> ---
> >>>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  include/linux/syscalls.h |  5 ++++
> >>>  2 files changed, 65 insertions(+)
> >>>
> >>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> >>> index e7a116d..2cc22c9 100644
> >>> --- a/fs/eventpoll.c
> >>> +++ b/fs/eventpoll.c
> >>> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >>>  			      sigmask ? &ksigmask : NULL);
> >>>  }
> >>>  
> >>> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> >>> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> >>> +		struct epoll_wait_spec __user *, spec)
> >>> +{
> >>> +	struct epoll_mod_cmd *kcmds = NULL;
> >>> +	int i, ret = 0;
> >>> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> >>> +
> >>> +	if (flags)
> >>> +		return -EINVAL;
> >>> +	if (ncmds) {
> >>> +		if (!cmds)
> >>> +			return -EINVAL;
> >>> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> >>> +		if (!kcmds)
> >>> +			return -ENOMEM;
> >>> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> >>> +			ret = -EFAULT;
> >>> +			goto out;
> >>> +		}
> >>> +	}
> >>> +	for (i = 0; i < ncmds; i++) {
> >>> +		struct epoll_event ev = (struct epoll_event) {
> >>> +			.events = kcmds[i].events,
> >>> +			.data = kcmds[i].data,
> >>> +		};
> >>> +		if (kcmds[i].flags) {
> >>> +			kcmds[i].error = ret = -EINVAL;
> >>
> >> To make the 'ret' change a little more obvious, maybe it's better to write
> >>
> >> 			ret = kcmds[i].error = -EINVAL;
> >>
> >>> +			goto out;
> >>> +		}
> >>> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> >>
> >> Likewise:
> >> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> >>
> >>> +		if (ret)
> >>> +			goto out;
> >>> +	}
> >>> +	if (spec) {
> >>> +		sigset_t ksigmask;
> >>> +		struct epoll_wait_spec kspec;
> >>> +		ktime_t timeout;
> >>> +
> >>> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> >>
> >> Cosmetic point: s/if(/if (/
> >>
> >>> +			return -EFAULT;
> >>> +		if (kspec.sigmask) {
> >>> +			if (kspec.sigsetsize != sizeof(sigset_t))
> >>> +				return -EINVAL;
> >>> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> >>> +				return -EFAULT;
> >>> +		}
> >>> +		timeout = timespec_to_ktime(kspec.timeout);
> >>> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> >>> +				     kspec.clockid, timeout,
> >>> +				     kspec.sigmask ? &ksigmask : NULL);
> >>
> >> If I understand correctly, the implementation means that the
> >> 'size_t sigsetsize' field will probably need to be exposed to 
> >> applications. In the existing epoll_pwait() call (as in  ppoll()
> >> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
> >> However, unless we expect glibc to do some structure copying to/from
> >> a structure that hides this field, then we're going end up exposing
> >> 'size_t sigsetsize' to applications. (This could be avoided, if we
> >> split the API as I suggest above. glibc would do the same thing 
> >> in epoll_pwait1() that it currently does in epoll_pwait().)
> 
> You missed responding to this point; I think it matters.
> (There were also some other points to consider in my reply 
> to your 0/6 mail.)
> 

My bad, something I typed when replying went into /dev/null for unknown reasons,
what I had was:

This should be easy to solve: glibc can be responsible for building spec, and
applications will do something like:

	epoll_mod_wait(epfd, ncmds, cmds, maxevents, events, clockid,
		       timeout, sigmask);

Thanks,
Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  8:58           ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  8:58 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Andy Lutomirski
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	linux-fsdevel

On Wed, 01/21 08:52, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 01/21/2015 05:59 AM, Fam Zheng wrote:
> > On Tue, 01/20 13:50, Michael Kerrisk (man-pages) wrote:
> >> Hello Fam Zheng,
> >>
> >> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> >>> This syscall is a sequence of
> >>>
> >>> 1) a number of epoll_ctl calls
> >>> 2) a epoll_pwait, with timeout enhancement.
> >>>
> >>> The epoll_ctl operations are embeded so that application doesn't have to use
> >>> separate syscalls to insert/delete/update the fds before poll. It is more
> >>> efficient if the set of fds varies from one poll to another, which is the
> >>> common pattern for certain applications. 
> >>
> >> Which applications? Could we have some specific examples? This is a 
> >> complex API, and it needs good justification.
> > 
> > OK, I'll explain more in v2.
> 
> Okay.
> 
> [...]
> 
> >>> The only complexity is returning the result of each operation.  For each
> >>> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> >>> the return code *iff* the command is executed (0 for success and -errno of the
> >>> equivalent epoll_ctl call), and will be left unchanged if the command is not
> >>> executed because some earlier error, for example due to failure of
> >>> copy_from_user to copy the array.
> >>>
> >>> Applications can utilize this fact to do error handling: they could initialize
> >>> all the epoll_mod_wait.error to a positive value, which is by definition not a
> >>> possible output value from epoll_mod_wait. Then when the syscall returned, they
> >>> know whether or not the command is executed by comparing each error with the
> >>> init value, if they're different, they have the result of the command.
> >>> More roughly, they can put any non-zero and not distinguish "not run" from
> >>> failure.
> >>
> >> The "cmds' are not executed in a specified order plus the need to
> >> initialize the 'errors' fields to a positive value feels a bit ugly.
> >> And indeed the whole "command list was only partially run" case
> >> is not pretty. Am I correct to understand that if an error is found
> >> during execution of one of the "epoll_ctl" commands in 'cmds' then
> >> the system call will return -1 with errno set, indicating an error,
> >> even though the epoll interest list may have changed because some
> >> of the earlier 'cmds' executed successfully? This all seems a bit of
> >> a headache for user space.
> > 
> > This is the trade-off for batching. The best we can do is probably make this
> > transactional: none or all of the commands succeeds. It will require a much
> > more complex implementation, though. 
> 
> Transactional would be more comfortable for user-space, and while I 
> can see that it would be complex to implement, perhaps the greater
> point might be that the implementation is CPU expensive.

Good point.

> 
> > But even with that, the error reporting on
> > which command failed is a complication.
> 
> My suggestions below could make the error reporting much simpler...
> 
> >> I have a couple of questions:
> >>
> >> Q1. I can see that batching "epoll_ctl" commands might be useful,
> >> since it results in fewer systems calls. But, does it really
> >> need to be bound together with the "epoll_pwait" functionality?
> >> (Perhaps this point was covered in previous discussions, but
> >> neither the message accompanying this patch nor the 0/6 man page
> >> provide a compelling rationale for the need to bind these two
> >> operations together.)
> >>
> >> Yes, I realize you might save a system call, but it makes for a
> >> cumbersome API that has the above headache, and also forces the 
> >> need for double pointer indirection in the 'spec' argument (i.e., 
> >> spec is a pointer to an array of structures where each element
> >> in turn includes an 'events' pointer that points to another array).
> >>
> >> Why not a simpler API with two syscalls such as:
> >>
> >> epoll_ctl_batch(int epfd, int flags,
> >>                 int ncmds, struct epoll_mod_cmd *cmds);
> >>
> >> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
> >>              struct timespec *timeout, int clock_id, 
> >>              const sigset_t *sigmask, size_t sigsetsize);
> > 
> > The problem is that there is no room for flags field in epoll_pwait1, which is
> > asked for, in previous discussion thread [1].
> 
> Ahh yes, I certainly should not have forgotten that. But that's easily solved.
> Do as for pselect6():
> 
> strcut sigargs {
>     const sigset_t *ss;
>     size_t          ss_len; /* Size (in bytes) of object pointed
>                                to by 'ss' */
> }
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, int flags,
>              int timeout,
>              const struct sigargs *sargs);
> 
> > I don't see epoll_mod_wait as a *significantly more* complicated interface
> > compared to epoll_ctl_batch and epoll_pwait1 above. 
> 
> My biggest problem with epoll_ctl_wait() is the complexity of error 
> handling. epoll_ctl_batch and epoll_pwait1 would simplify that, as I 
> note below.
> 
> Aside from that, I do think that epoll_ctl_wait() passes a certain
> threshold of complexity that warrants good justification, which
> I haven't really seen yet.

OK, see below.

> 
> > In epoll_mod_wait, if you
> > leave out ncmds and cmds, it is effectively a poll without batch; and if
> > leaving out spec, it is effectively a batch without poll.
> > 
> > The most important change here is the timeout. IMO I wouldn't mind leaving out
> > batching. Integrating it is requested by Andy:
> > 
> > [1]: http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591
> > 
> > which also made sense to me; I do have a patch in QEMU to call epoll_ctl for a
> > number of times right before epoll_wait.
> > 
> > [Sorry for not putting anything into cover letter changelog, but it is also
> > interesting to see people's reaction on the patch itself without bias of
> > others' opinions. This indeed brings in more points. :]
> 
> But it also has the downside that the same discussions
> may be repeated.

Yes, that's why I definitely do it for any version that is > v1.

> 
> >> This gives us much of the benefit of reducing system calls, but 
> >> with greater simplicity. And epoll_ctl_batch() could simply return
> >> the number of 'cmds' that were successfully executed.)
> >>
> >> Q2. In the man page in 0/6 you said that the 'cmds' were not 
> >> guaranteed to be executed in order. Why not? If you did provide
> >> such a guarantee, then, when using your current epoll_mod_wait(),
> >> user space could do the following:
> > 
> > I guess we can make a guarentee on that.
> 
> I'm puzzled by that response. Surely you *must* guarantee it.
> If there's no defined order, then if the batch includes multiple 
> commands that operate on the same FD, the result is undefined 
> unless you provide that guarantee. (Unless, of course, you want to
> explicitly specify that using the same FD multiple times in the
> batch gives undefined behavior.)

OK. Then let's guarentee it.

> 
> >> 1. Initialize the cmd.errors fields to zero.
> >> 2. Call epoll_ctl_mod()
> >> 3. Iterate through cmd.errors looking for the first nonzero 
> >>    field.
> > 
> > It's close, but zero is not good enough, if copy_from_user of cmds failed in
> > the first place. Impossible value, or error value, will be safer.
> 
> See my comment in the earlier mail. If you split this into two 
> APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
> then the return value of epoll_ctl_batch() could be used to tell
> user space how many commands succeeded. Much simpler!

Yes it is much simpler. However the reason to add batching in the first place is
to make epoll faster, by reducing syscalls. Splitting makes the result
sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
proposed new call *is* a step forward, but I don't think we will have everything
solved even by implementing them all. Compromise needed between performance or
complexity.

My take for simplicity will be leaving epoll_ctl as-is, and my take for
performance will be epoll_pwait1. And I don't really like putting my time on
epoll_ctl_batch, thinking it as a ambivalent compromise in between.

Andy, will you be OK with the above epoll_pwait1? Or do you have more
justifications for epoll_mod_and_wait?

> 
> >>> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> >>> scalar. This provides higher precision. 
> >>
> >> Yes, that change seemed inevitable. It slightly puzzled me at the time when
> >> Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
> >> though pselect() already had demonstrated the need for higher precision.
> >> I should have called it out way back then :-{.
> >>
> >>> The parameter field in struct
> >>> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> >>> clock than the default when it makes more sense.
> >>>
> >>> Signed-off-by: Fam Zheng <famz@redhat.com>
> >>> ---
> >>>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  include/linux/syscalls.h |  5 ++++
> >>>  2 files changed, 65 insertions(+)
> >>>
> >>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> >>> index e7a116d..2cc22c9 100644
> >>> --- a/fs/eventpoll.c
> >>> +++ b/fs/eventpoll.c
> >>> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >>>  			      sigmask ? &ksigmask : NULL);
> >>>  }
> >>>  
> >>> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> >>> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> >>> +		struct epoll_wait_spec __user *, spec)
> >>> +{
> >>> +	struct epoll_mod_cmd *kcmds = NULL;
> >>> +	int i, ret = 0;
> >>> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> >>> +
> >>> +	if (flags)
> >>> +		return -EINVAL;
> >>> +	if (ncmds) {
> >>> +		if (!cmds)
> >>> +			return -EINVAL;
> >>> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> >>> +		if (!kcmds)
> >>> +			return -ENOMEM;
> >>> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> >>> +			ret = -EFAULT;
> >>> +			goto out;
> >>> +		}
> >>> +	}
> >>> +	for (i = 0; i < ncmds; i++) {
> >>> +		struct epoll_event ev = (struct epoll_event) {
> >>> +			.events = kcmds[i].events,
> >>> +			.data = kcmds[i].data,
> >>> +		};
> >>> +		if (kcmds[i].flags) {
> >>> +			kcmds[i].error = ret = -EINVAL;
> >>
> >> To make the 'ret' change a little more obvious, maybe it's better to write
> >>
> >> 			ret = kcmds[i].error = -EINVAL;
> >>
> >>> +			goto out;
> >>> +		}
> >>> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> >>
> >> Likewise:
> >> 		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> >>
> >>> +		if (ret)
> >>> +			goto out;
> >>> +	}
> >>> +	if (spec) {
> >>> +		sigset_t ksigmask;
> >>> +		struct epoll_wait_spec kspec;
> >>> +		ktime_t timeout;
> >>> +
> >>> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> >>
> >> Cosmetic point: s/if(/if (/
> >>
> >>> +			return -EFAULT;
> >>> +		if (kspec.sigmask) {
> >>> +			if (kspec.sigsetsize != sizeof(sigset_t))
> >>> +				return -EINVAL;
> >>> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> >>> +				return -EFAULT;
> >>> +		}
> >>> +		timeout = timespec_to_ktime(kspec.timeout);
> >>> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> >>> +				     kspec.clockid, timeout,
> >>> +				     kspec.sigmask ? &ksigmask : NULL);
> >>
> >> If I understand correctly, the implementation means that the
> >> 'size_t sigsetsize' field will probably need to be exposed to 
> >> applications. In the existing epoll_pwait() call (as in  ppoll()
> >> and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
> >> However, unless we expect glibc to do some structure copying to/from
> >> a structure that hides this field, then we're going end up exposing
> >> 'size_t sigsetsize' to applications. (This could be avoided, if we
> >> split the API as I suggest above. glibc would do the same thing 
> >> in epoll_pwait1() that it currently does in epoll_pwait().)
> 
> You missed responding to this point; I think it matters.
> (There were also some other points to consider in my reply 
> to your 0/6 mail.)
> 

My bad, something I typed when replying went into /dev/null for unknown reasons,
what I had was:

This should be easy to solve: glibc can be responsible for building spec, and
applications will do something like:

	epoll_mod_wait(epfd, ncmds, cmds, maxevents, events, clockid,
		       timeout, sigmask);

Thanks,
Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
  2015-01-21  7:56     ` Omar Sandoval
  (?)
@ 2015-01-21  8:59       ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  8:59 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

On Tue, 01/20 23:56, Omar Sandoval wrote:
> On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> > This syscall is a sequence of
> > 
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> > 
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> > 
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> > 
> > The only complexity is returning the result of each operation.  For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> > 
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> > 
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/syscalls.h |  5 ++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >  			      sigmask ? &ksigmask : NULL);
> >  }
> >  
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > +		struct epoll_wait_spec __user *, spec)
> > +{
> > +	struct epoll_mod_cmd *kcmds = NULL;
> > +	int i, ret = 0;
> > +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (ncmds) {
> > +		if (!cmds)
> > +			return -EINVAL;
> > +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > +		if (!kcmds)
> > +			return -ENOMEM;
> > +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +	}
> > +	for (i = 0; i < ncmds; i++) {
> > +		struct epoll_event ev = (struct epoll_event) {
> > +			.events = kcmds[i].events,
> > +			.data = kcmds[i].data,
> > +		};
> > +		if (kcmds[i].flags) {
> > +			kcmds[i].error = ret = -EINVAL;
> > +			goto out;
> > +		}
> > +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	if (spec) {
> > +		sigset_t ksigmask;
> > +		struct epoll_wait_spec kspec;
> > +		ktime_t timeout;
> > +
> > +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> > +			return -EFAULT;
> This should probably be goto out, or you'll leak kcmds.
> 
> > +		if (kspec.sigmask) {
> > +			if (kspec.sigsetsize != sizeof(sigset_t))
> > +				return -EINVAL;
> Same here...
> 
> > +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > +				return -EFAULT;
> and here.
> 
> > +		}
> > +		timeout = timespec_to_ktime(kspec.timeout);
> > +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > +				     kspec.clockid, timeout,
> > +				     kspec.sigmask ? &ksigmask : NULL);
> > +	}
> > +
> > +out:
> > +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> > +		return -EFAULT;
> This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
> to lead to a weird corner case: if cmds is read-only, we'll end up executing
> every command but fail to copy out the return values, so when userspace gets the
> EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
> means you're probably doing something wrong anyways, so maybe not the biggest
> concern.

Yes, thanks! Will fix this.

Fam

> 
> > +	kfree(kcmds);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_COMPAT
> >  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
> >  		       struct epoll_event __user *, events,
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 85893d7..7156c80 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -12,6 +12,8 @@
> >  #define _LINUX_SYSCALLS_H
> >  
> >  struct epoll_event;
> > +struct epoll_mod_cmd;
> > +struct epoll_wait_spec;
> >  struct iattr;
> >  struct inode;
> >  struct iocb;
> > @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
> >  				int maxevents, int timeout,
> >  				const sigset_t __user *sigmask,
> >  				size_t sigsetsize);
> > +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> > +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> > +				   struct epoll_wait_spec __user * spec);
> >  asmlinkage long sys_gethostname(char __user *name, int len);
> >  asmlinkage long sys_sethostname(char __user *name, int len);
> >  asmlinkage long sys_setdomainname(char __user *name, int len);
> > -- 
> > 1.9.3
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Omar

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  8:59       ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  8:59 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, 01/20 23:56, Omar Sandoval wrote:
> On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> > This syscall is a sequence of
> > 
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> > 
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> > 
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> > 
> > The only complexity is returning the result of each operation.  For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> > 
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> > 
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/syscalls.h |  5 ++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >  			      sigmask ? &ksigmask : NULL);
> >  }
> >  
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > +		struct epoll_wait_spec __user *, spec)
> > +{
> > +	struct epoll_mod_cmd *kcmds = NULL;
> > +	int i, ret = 0;
> > +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (ncmds) {
> > +		if (!cmds)
> > +			return -EINVAL;
> > +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > +		if (!kcmds)
> > +			return -ENOMEM;
> > +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +	}
> > +	for (i = 0; i < ncmds; i++) {
> > +		struct epoll_event ev = (struct epoll_event) {
> > +			.events = kcmds[i].events,
> > +			.data = kcmds[i].data,
> > +		};
> > +		if (kcmds[i].flags) {
> > +			kcmds[i].error = ret = -EINVAL;
> > +			goto out;
> > +		}
> > +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	if (spec) {
> > +		sigset_t ksigmask;
> > +		struct epoll_wait_spec kspec;
> > +		ktime_t timeout;
> > +
> > +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> > +			return -EFAULT;
> This should probably be goto out, or you'll leak kcmds.
> 
> > +		if (kspec.sigmask) {
> > +			if (kspec.sigsetsize != sizeof(sigset_t))
> > +				return -EINVAL;
> Same here...
> 
> > +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > +				return -EFAULT;
> and here.
> 
> > +		}
> > +		timeout = timespec_to_ktime(kspec.timeout);
> > +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > +				     kspec.clockid, timeout,
> > +				     kspec.sigmask ? &ksigmask : NULL);
> > +	}
> > +
> > +out:
> > +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> > +		return -EFAULT;
> This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
> to lead to a weird corner case: if cmds is read-only, we'll end up executing
> every command but fail to copy out the return values, so when userspace gets the
> EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
> means you're probably doing something wrong anyways, so maybe not the biggest
> concern.

Yes, thanks! Will fix this.

Fam

> 
> > +	kfree(kcmds);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_COMPAT
> >  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
> >  		       struct epoll_event __user *, events,
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 85893d7..7156c80 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -12,6 +12,8 @@
> >  #define _LINUX_SYSCALLS_H
> >  
> >  struct epoll_event;
> > +struct epoll_mod_cmd;
> > +struct epoll_wait_spec;
> >  struct iattr;
> >  struct inode;
> >  struct iocb;
> > @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
> >  				int maxevents, int timeout,
> >  				const sigset_t __user *sigmask,
> >  				size_t sigsetsize);
> > +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> > +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> > +				   struct epoll_wait_spec __user * spec);
> >  asmlinkage long sys_gethostname(char __user *name, int len);
> >  asmlinkage long sys_sethostname(char __user *name, int len);
> >  asmlinkage long sys_setdomainname(char __user *name, int len);
> > -- 
> > 1.9.3
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Omar

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21  8:59       ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  8:59 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, 01/20 23:56, Omar Sandoval wrote:
> On Tue, Jan 20, 2015 at 05:57:57PM +0800, Fam Zheng wrote:
> > This syscall is a sequence of
> > 
> > 1) a number of epoll_ctl calls
> > 2) a epoll_pwait, with timeout enhancement.
> > 
> > The epoll_ctl operations are embeded so that application doesn't have to use
> > separate syscalls to insert/delete/update the fds before poll. It is more
> > efficient if the set of fds varies from one poll to another, which is the
> > common pattern for certain applications. For example, depending on the input
> > buffer status, a data reading program may decide to temporarily not polling an
> > fd.
> > 
> > Because the enablement of batching in this interface, even that regular
> > epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> > single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
> > 
> > The only complexity is returning the result of each operation.  For each
> > epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> > the return code *iff* the command is executed (0 for success and -errno of the
> > equivalent epoll_ctl call), and will be left unchanged if the command is not
> > executed because some earlier error, for example due to failure of
> > copy_from_user to copy the array.
> > 
> > Applications can utilize this fact to do error handling: they could initialize
> > all the epoll_mod_wait.error to a positive value, which is by definition not a
> > possible output value from epoll_mod_wait. Then when the syscall returned, they
> > know whether or not the command is executed by comparing each error with the
> > init value, if they're different, they have the result of the command.
> > More roughly, they can put any non-zero and not distinguish "not run" from
> > failure.
> > 
> > Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> > scalar. This provides higher precision. The parameter field in struct
> > epoll_wait_spec, "clockid", also makes it possible for users to use a different
> > clock than the default when it makes more sense.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/syscalls.h |  5 ++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e7a116d..2cc22c9 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
> >  			      sigmask ? &ksigmask : NULL);
> >  }
> >  
> > +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> > +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> > +		struct epoll_wait_spec __user *, spec)
> > +{
> > +	struct epoll_mod_cmd *kcmds = NULL;
> > +	int i, ret = 0;
> > +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +	if (ncmds) {
> > +		if (!cmds)
> > +			return -EINVAL;
> > +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > +		if (!kcmds)
> > +			return -ENOMEM;
> > +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +	}
> > +	for (i = 0; i < ncmds; i++) {
> > +		struct epoll_event ev = (struct epoll_event) {
> > +			.events = kcmds[i].events,
> > +			.data = kcmds[i].data,
> > +		};
> > +		if (kcmds[i].flags) {
> > +			kcmds[i].error = ret = -EINVAL;
> > +			goto out;
> > +		}
> > +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	if (spec) {
> > +		sigset_t ksigmask;
> > +		struct epoll_wait_spec kspec;
> > +		ktime_t timeout;
> > +
> > +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))
> > +			return -EFAULT;
> This should probably be goto out, or you'll leak kcmds.
> 
> > +		if (kspec.sigmask) {
> > +			if (kspec.sigsetsize != sizeof(sigset_t))
> > +				return -EINVAL;
> Same here...
> 
> > +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> > +				return -EFAULT;
> and here.
> 
> > +		}
> > +		timeout = timespec_to_ktime(kspec.timeout);
> > +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> > +				     kspec.clockid, timeout,
> > +				     kspec.sigmask ? &ksigmask : NULL);
> > +	}
> > +
> > +out:
> > +	if (ncmds && copy_to_user(cmds, kcmds, cmd_size))
> > +		return -EFAULT;
> This will also leak kcmds, it should be ret = -EFAULT. This case, however, seems
> to lead to a weird corner case: if cmds is read-only, we'll end up executing
> every command but fail to copy out the return values, so when userspace gets the
> EFAULT, it won't know whether anything was executed. But, getting an EFAULT here
> means you're probably doing something wrong anyways, so maybe not the biggest
> concern.

Yes, thanks! Will fix this.

Fam

> 
> > +	kfree(kcmds);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_COMPAT
> >  COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
> >  		       struct epoll_event __user *, events,
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 85893d7..7156c80 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -12,6 +12,8 @@
> >  #define _LINUX_SYSCALLS_H
> >  
> >  struct epoll_event;
> > +struct epoll_mod_cmd;
> > +struct epoll_wait_spec;
> >  struct iattr;
> >  struct inode;
> >  struct iocb;
> > @@ -630,6 +632,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
> >  				int maxevents, int timeout,
> >  				const sigset_t __user *sigmask,
> >  				size_t sigsetsize);
> > +asmlinkage long sys_epoll_mod_wait(int epfd, int flags,
> > +				   int ncmds, struct epoll_mod_cmd __user * cmds,
> > +				   struct epoll_wait_spec __user * spec);
> >  asmlinkage long sys_gethostname(char __user *name, int len);
> >  asmlinkage long sys_sethostname(char __user *name, int len);
> >  asmlinkage long sys_setdomainname(char __user *name, int len);
> > -- 
> > 1.9.3
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Omar

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  9:05     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  9:05 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini

On Tue, 01/20 13:48, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> I know this API has been through a number of iterations, and there were 
> discussions about the design that led to it becoming more complex.
> But, let us assume that someone has not seen those discussions,
> or forgotten them, or is too lazy to go hunting list archives.
> 
> Then: this patch series should somewhere have an explanation of
> why the API is what it is, ideally with links to previous relevant
> discussions. I see that you do part of that in
> 
>     [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
> 
> There are however no links to previous discussions in that mail (I guess
> http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
> relevant, nor is there any sort of change log in the commit message 
> that explains the evolution of the API. Having those would ease the 
> task of reviewers.
> 
> Coming back to THIS mail, this man page should also include an
> explanation of why the API is what it is. That would include much
> of the detail from the 5/6 patch, and probably more info besides.
> 
> Some specific points below.
> 
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> > 
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> > 
> > SYNOPSIS
> > 
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> > 
> > DESCRIPTION
> > 
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >        
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> 
> s/microsecond/millisecond/

Yes, thanks for pointing out.

> >        if all the commands are successfully executed (all the error fields are
> >        set to 0), events are polled.
> 
> Does the operation execute all commands, or stop when it encounters the first 
> error? In other words, when looping over the returned 'error' fields, what
> is the termination condition for the user-space application?
> 
> (Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
> but the man page should explicitly state this so that I don't have to 
> read the source, and also because it is only if you explicitly document 
> the intended behavior that I can tell whether the actual implementation 
> matches the intention.)


> 
> >        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
> >        contains the information about how to poll the events. If it's NULL, this
> >        call will immediately return after running all the commands in cmds.
> > 
> >        The structure is defined as below:
> > 
> >            struct epoll_wait_spec {
> > 
> >                   /* The same as "maxevents" in epoll_pwait() */
> >                   int maxevents;
> > 
> >                   /* The same as "events" in epoll_pwait() */
> >                   struct epoll_event *events;
> > 
> >                   /* Which clock to use for timeout */
> >                   int clockid;
> 
> Which clocks can be specified here?
> CLOCK_MONOTONIC?
> CLOCK_REALTIME?
> CLOCK_PROCESS_CPUTIME_ID?
> clock_getcpuclockid()?
> others?

At the moment we can limit it to CLOCK_MONOTONIC and CLOCK_REALTIME, I'm not
sure any application care about others. It's not checked in this series, but
should be done in v2.

> 
> >                   /* Maximum time to wait if there is no event */
> >                   struct timespec timeout;
> 
> Is this timeout relative or absolute?

Relative. I'll document it. Absolute timeout can be added later with new flags.

> 
> >                   /* The same as "sigmask" in epoll_pwait() */
> >                   sigset_t *sigmask;
> 
> I just want to confirm here that 'sigmask' can be NULL, meaning
> that we degenerate to epoll_wait() functionality, right?

Yes. Will document explicitly.

> 
> >                   /* The same as "sigsetsize" in epoll_pwait() */
> >                   size_t sigsetsize;
> >            } EPOLL_PACKED;
> 
> What is the "EPOLL_PACKED" here for?

Copy paste error. :)

> 
> > RETURN VALUE
> > 
> >        When any error occurs, epoll_mod_wait() returns -1 and errno is set
> >        appropriately. All the "error" fields in cmds are unchanged before they
> >        are executed, and if any cmds are executed, the "error" fields are set
> >        to a return code accordingly. See also epoll_ctl for more details of the
> >        return code.
> > 
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> 
> s/milliseconds//

OK.

> 
> > 
> >        If spec is NULL, it returns 0 if all the commands are successful, and -1
> >        if an error occured.
> 
> s/occured/occurred/

OK, thanks.

> 
> > ERRORS
> > 
> >        These errors apply on either the return value of epoll_mod_wait or error
> >        status for each command, respectively.
> >
> >        EBADF  epfd or fd is not a valid file descriptor.
> > 
> >        EFAULT The memory area pointed to by events is not accessible with write
> >               permissions.
> > 
> >        EINTR  The call was interrupted by a signal handler before either any of
> >               the requested events occurred or the timeout expired; see
> >               signal(7).
> > 
> >        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
> >               or equal to zero, or fd is the same as epfd, or the requested
> >               operation op is not supported by this interface.
> 
> Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

Yes.

> 
> >        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
> >               already registered with this epoll instance.
> > 
> >        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
> >               with this epoll instance.
> > 
> >        ENOMEM There was insufficient memory to handle the requested op control
> >               operation.
> > 
> >        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
> >               encountered while trying to register (EPOLL_CTL_ADD) a new file
> >               descriptor on an epoll instance.  See epoll(7) for further
> >               details.
> > 
> >        EPERM  The target file fd does not support epoll.
> > 
> > CONFORMING TO
> > 
> >        epoll_mod_wait() is Linux-specific.
> > 
> > SEE ALSO
> > 
> >        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
> 
> Please add sigprocmask(2).

OK! Thanks for reviewing this.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  9:05     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  9:05 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, 01/20 13:48, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> I know this API has been through a number of iterations, and there were 
> discussions about the design that led to it becoming more complex.
> But, let us assume that someone has not seen those discussions,
> or forgotten them, or is too lazy to go hunting list archives.
> 
> Then: this patch series should somewhere have an explanation of
> why the API is what it is, ideally with links to previous relevant
> discussions. I see that you do part of that in
> 
>     [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
> 
> There are however no links to previous discussions in that mail (I guess
> http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
> relevant, nor is there any sort of change log in the commit message 
> that explains the evolution of the API. Having those would ease the 
> task of reviewers.
> 
> Coming back to THIS mail, this man page should also include an
> explanation of why the API is what it is. That would include much
> of the detail from the 5/6 patch, and probably more info besides.
> 
> Some specific points below.
> 
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> > 
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> > 
> > SYNOPSIS
> > 
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> > 
> > DESCRIPTION
> > 
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >        
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> 
> s/microsecond/millisecond/

Yes, thanks for pointing out.

> >        if all the commands are successfully executed (all the error fields are
> >        set to 0), events are polled.
> 
> Does the operation execute all commands, or stop when it encounters the first 
> error? In other words, when looping over the returned 'error' fields, what
> is the termination condition for the user-space application?
> 
> (Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
> but the man page should explicitly state this so that I don't have to 
> read the source, and also because it is only if you explicitly document 
> the intended behavior that I can tell whether the actual implementation 
> matches the intention.)


> 
> >        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
> >        contains the information about how to poll the events. If it's NULL, this
> >        call will immediately return after running all the commands in cmds.
> > 
> >        The structure is defined as below:
> > 
> >            struct epoll_wait_spec {
> > 
> >                   /* The same as "maxevents" in epoll_pwait() */
> >                   int maxevents;
> > 
> >                   /* The same as "events" in epoll_pwait() */
> >                   struct epoll_event *events;
> > 
> >                   /* Which clock to use for timeout */
> >                   int clockid;
> 
> Which clocks can be specified here?
> CLOCK_MONOTONIC?
> CLOCK_REALTIME?
> CLOCK_PROCESS_CPUTIME_ID?
> clock_getcpuclockid()?
> others?

At the moment we can limit it to CLOCK_MONOTONIC and CLOCK_REALTIME, I'm not
sure any application care about others. It's not checked in this series, but
should be done in v2.

> 
> >                   /* Maximum time to wait if there is no event */
> >                   struct timespec timeout;
> 
> Is this timeout relative or absolute?

Relative. I'll document it. Absolute timeout can be added later with new flags.

> 
> >                   /* The same as "sigmask" in epoll_pwait() */
> >                   sigset_t *sigmask;
> 
> I just want to confirm here that 'sigmask' can be NULL, meaning
> that we degenerate to epoll_wait() functionality, right?

Yes. Will document explicitly.

> 
> >                   /* The same as "sigsetsize" in epoll_pwait() */
> >                   size_t sigsetsize;
> >            } EPOLL_PACKED;
> 
> What is the "EPOLL_PACKED" here for?

Copy paste error. :)

> 
> > RETURN VALUE
> > 
> >        When any error occurs, epoll_mod_wait() returns -1 and errno is set
> >        appropriately. All the "error" fields in cmds are unchanged before they
> >        are executed, and if any cmds are executed, the "error" fields are set
> >        to a return code accordingly. See also epoll_ctl for more details of the
> >        return code.
> > 
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> 
> s/milliseconds//

OK.

> 
> > 
> >        If spec is NULL, it returns 0 if all the commands are successful, and -1
> >        if an error occured.
> 
> s/occured/occurred/

OK, thanks.

> 
> > ERRORS
> > 
> >        These errors apply on either the return value of epoll_mod_wait or error
> >        status for each command, respectively.
> >
> >        EBADF  epfd or fd is not a valid file descriptor.
> > 
> >        EFAULT The memory area pointed to by events is not accessible with write
> >               permissions.
> > 
> >        EINTR  The call was interrupted by a signal handler before either any of
> >               the requested events occurred or the timeout expired; see
> >               signal(7).
> > 
> >        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
> >               or equal to zero, or fd is the same as epfd, or the requested
> >               operation op is not supported by this interface.
> 
> Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

Yes.

> 
> >        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
> >               already registered with this epoll instance.
> > 
> >        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
> >               with this epoll instance.
> > 
> >        ENOMEM There was insufficient memory to handle the requested op control
> >               operation.
> > 
> >        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
> >               encountered while trying to register (EPOLL_CTL_ADD) a new file
> >               descriptor on an epoll instance.  See epoll(7) for further
> >               details.
> > 
> >        EPERM  The target file fd does not support epoll.
> > 
> > CONFORMING TO
> > 
> >        epoll_mod_wait() is Linux-specific.
> > 
> > SEE ALSO
> > 
> >        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
> 
> Please add sigprocmask(2).

OK! Thanks for reviewing this.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  9:05     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  9:05 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, 01/20 13:48, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> I know this API has been through a number of iterations, and there were 
> discussions about the design that led to it becoming more complex.
> But, let us assume that someone has not seen those discussions,
> or forgotten them, or is too lazy to go hunting list archives.
> 
> Then: this patch series should somewhere have an explanation of
> why the API is what it is, ideally with links to previous relevant
> discussions. I see that you do part of that in
> 
>     [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
> 
> There are however no links to previous discussions in that mail (I guess
> http://thread.gmane.org/gmane.linux.kernel/1861430/focus=91591 is most
> relevant, nor is there any sort of change log in the commit message 
> that explains the evolution of the API. Having those would ease the 
> task of reviewers.
> 
> Coming back to THIS mail, this man page should also include an
> explanation of why the API is what it is. That would include much
> of the detail from the 5/6 patch, and probably more info besides.
> 
> Some specific points below.
> 
> On 01/20/2015 10:57 AM, Fam Zheng wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> > 
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> > 
> > SYNOPSIS
> > 
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> > 
> > DESCRIPTION
> > 
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >        
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> 
> s/microsecond/millisecond/

Yes, thanks for pointing out.

> >        if all the commands are successfully executed (all the error fields are
> >        set to 0), events are polled.
> 
> Does the operation execute all commands, or stop when it encounters the first 
> error? In other words, when looping over the returned 'error' fields, what
> is the termination condition for the user-space application?
> 
> (Yes, I know I can trivially inspect the patch 5/6 to answer this question, 
> but the man page should explicitly state this so that I don't have to 
> read the source, and also because it is only if you explicitly document 
> the intended behavior that I can tell whether the actual implementation 
> matches the intention.)


> 
> >        The last parameter "spec" is a pointer to struct epoll_wait_spec, which
> >        contains the information about how to poll the events. If it's NULL, this
> >        call will immediately return after running all the commands in cmds.
> > 
> >        The structure is defined as below:
> > 
> >            struct epoll_wait_spec {
> > 
> >                   /* The same as "maxevents" in epoll_pwait() */
> >                   int maxevents;
> > 
> >                   /* The same as "events" in epoll_pwait() */
> >                   struct epoll_event *events;
> > 
> >                   /* Which clock to use for timeout */
> >                   int clockid;
> 
> Which clocks can be specified here?
> CLOCK_MONOTONIC?
> CLOCK_REALTIME?
> CLOCK_PROCESS_CPUTIME_ID?
> clock_getcpuclockid()?
> others?

At the moment we can limit it to CLOCK_MONOTONIC and CLOCK_REALTIME, I'm not
sure any application care about others. It's not checked in this series, but
should be done in v2.

> 
> >                   /* Maximum time to wait if there is no event */
> >                   struct timespec timeout;
> 
> Is this timeout relative or absolute?

Relative. I'll document it. Absolute timeout can be added later with new flags.

> 
> >                   /* The same as "sigmask" in epoll_pwait() */
> >                   sigset_t *sigmask;
> 
> I just want to confirm here that 'sigmask' can be NULL, meaning
> that we degenerate to epoll_wait() functionality, right?

Yes. Will document explicitly.

> 
> >                   /* The same as "sigsetsize" in epoll_pwait() */
> >                   size_t sigsetsize;
> >            } EPOLL_PACKED;
> 
> What is the "EPOLL_PACKED" here for?

Copy paste error. :)

> 
> > RETURN VALUE
> > 
> >        When any error occurs, epoll_mod_wait() returns -1 and errno is set
> >        appropriately. All the "error" fields in cmds are unchanged before they
> >        are executed, and if any cmds are executed, the "error" fields are set
> >        to a return code accordingly. See also epoll_ctl for more details of the
> >        return code.
> > 
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> 
> s/milliseconds//

OK.

> 
> > 
> >        If spec is NULL, it returns 0 if all the commands are successful, and -1
> >        if an error occured.
> 
> s/occured/occurred/

OK, thanks.

> 
> > ERRORS
> > 
> >        These errors apply on either the return value of epoll_mod_wait or error
> >        status for each command, respectively.
> >
> >        EBADF  epfd or fd is not a valid file descriptor.
> > 
> >        EFAULT The memory area pointed to by events is not accessible with write
> >               permissions.
> > 
> >        EINTR  The call was interrupted by a signal handler before either any of
> >               the requested events occurred or the timeout expired; see
> >               signal(7).
> > 
> >        EINVAL epfd is not an epoll file descriptor, or maxevents is less than
> >               or equal to zero, or fd is the same as epfd, or the requested
> >               operation op is not supported by this interface.
> 
> Add: Or 'flags' is nonzero. Or a 'cmds.flags' field is nonzero.

Yes.

> 
> >        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
> >               already registered with this epoll instance.
> > 
> >        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
> >               with this epoll instance.
> > 
> >        ENOMEM There was insufficient memory to handle the requested op control
> >               operation.
> > 
> >        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
> >               encountered while trying to register (EPOLL_CTL_ADD) a new file
> >               descriptor on an epoll instance.  See epoll(7) for further
> >               details.
> > 
> >        EPERM  The target file fd does not support epoll.
> > 
> > CONFORMING TO
> > 
> >        epoll_mod_wait() is Linux-specific.
> > 
> > SEE ALSO
> > 
> >        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
> 
> Please add sigprocmask(2).

OK! Thanks for reviewing this.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  9:07     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  9:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	Linux FS Devel, Linux API, Josh Triplett,
	Michael Kerrisk (man-pages),
	Paolo Bonzini

On Tue, 01/20 14:40, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz@redhat.com> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> >
> > SYNOPSIS
> >
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> >
> >        The epoll_ctl(2) operations are embedded into this call by with ncmds
> >        and cmds. The latter is an array of command structs:
> >
> >            struct epoll_mod_cmd {
> >
> >                   /* Reserved flags for future extension, must be 0 for now. */
> >                   int flags;
> >
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> >
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> >
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> >
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> >
> >                   /* Output field, will be set to the return code once this
> >                    * command is executed by kernel */
> >                   int error;
> >            };
> 
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.

OK, makes sense.

> 
> >
> >        There is no guartantee that all the commands are executed in order. Only
> >        if all the commands are successfully executed (all the error fields are
> >        set to 0), events are polled.
> 
> If this doesn't happen, what error is returned?

The last error in executing commands.

> 
> >            struct epoll_wait_spec {
> >
> >                   /* The same as "maxevents" in epoll_pwait() */
> >                   int maxevents;
> >
> >                   /* The same as "events" in epoll_pwait() */
> >                   struct epoll_event *events;
> >
> >                   /* Which clock to use for timeout */
> >                   int clockid;
> >
> >                   /* Maximum time to wait if there is no event */
> >                   struct timespec timeout;
> >
> >                   /* The same as "sigmask" in epoll_pwait() */
> >                   sigset_t *sigmask;
> >
> >                   /* The same as "sigsetsize" in epoll_pwait() */
> >                   size_t sigsetsize;
> >            } EPOLL_PACKED;
> 
> I think the convention is to align the structure's fields manually
> rather than declaring it to be packed.

OK.

> 
> >
> > RETURN VALUE
> >
> >        When any error occurs, epoll_mod_wait() returns -1 and errno is set
> >        appropriately. All the "error" fields in cmds are unchanged before they
> >        are executed, and if any cmds are executed, the "error" fields are set
> >        to a return code accordingly. See also epoll_ctl for more details of the
> >        return code.
> 
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?

Yes.

> 
> >
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> >
> >        If spec is NULL, it returns 0 if all the commands are successful, and -1
> >        if an error occured.
> >
> > ERRORS
> >
> >        These errors apply on either the return value of epoll_mod_wait or error
> >        status for each command, respectively.
> 
> Please clarify which errors are returned overall and which are per-command.

OK.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  9:07     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  9:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, 01/20 14:40, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> >
> > SYNOPSIS
> >
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> >
> >        The epoll_ctl(2) operations are embedded into this call by with ncmds
> >        and cmds. The latter is an array of command structs:
> >
> >            struct epoll_mod_cmd {
> >
> >                   /* Reserved flags for future extension, must be 0 for now. */
> >                   int flags;
> >
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> >
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> >
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> >
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> >
> >                   /* Output field, will be set to the return code once this
> >                    * command is executed by kernel */
> >                   int error;
> >            };
> 
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.

OK, makes sense.

> 
> >
> >        There is no guartantee that all the commands are executed in order. Only
> >        if all the commands are successfully executed (all the error fields are
> >        set to 0), events are polled.
> 
> If this doesn't happen, what error is returned?

The last error in executing commands.

> 
> >            struct epoll_wait_spec {
> >
> >                   /* The same as "maxevents" in epoll_pwait() */
> >                   int maxevents;
> >
> >                   /* The same as "events" in epoll_pwait() */
> >                   struct epoll_event *events;
> >
> >                   /* Which clock to use for timeout */
> >                   int clockid;
> >
> >                   /* Maximum time to wait if there is no event */
> >                   struct timespec timeout;
> >
> >                   /* The same as "sigmask" in epoll_pwait() */
> >                   sigset_t *sigmask;
> >
> >                   /* The same as "sigsetsize" in epoll_pwait() */
> >                   size_t sigsetsize;
> >            } EPOLL_PACKED;
> 
> I think the convention is to align the structure's fields manually
> rather than declaring it to be packed.

OK.

> 
> >
> > RETURN VALUE
> >
> >        When any error occurs, epoll_mod_wait() returns -1 and errno is set
> >        appropriately. All the "error" fields in cmds are unchanged before they
> >        are executed, and if any cmds are executed, the "error" fields are set
> >        to a return code accordingly. See also epoll_ctl for more details of the
> >        return code.
> 
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?

Yes.

> 
> >
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> >
> >        If spec is NULL, it returns 0 if all the commands are successful, and -1
> >        if an error occured.
> >
> > ERRORS
> >
> >        These errors apply on either the return value of epoll_mod_wait or error
> >        status for each command, respectively.
> 
> Please clarify which errors are returned overall and which are per-command.

OK.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait"
@ 2015-01-21  9:07     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21  9:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra

On Tue, 01/20 14:40, Andy Lutomirski wrote:
> On Tue, Jan 20, 2015 at 1:57 AM, Fam Zheng <famz-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This adds a new system call, epoll_mod_wait. It's described as below:
> >
> > NAME
> >        epoll_mod_wait - modify and wait for I/O events on an epoll file
> >                         descriptor
> >
> > SYNOPSIS
> >
> >        int epoll_mod_wait(int epfd, int flags,
> >                           int ncmds, struct epoll_mod_cmd *cmds,
> >                           struct epoll_wait_spec *spec);
> >
> > DESCRIPTION
> >
> >        The epoll_mod_wait() system call can be seen as an enhanced combination
> >        of several epoll_ctl(2) calls, which are followed by an epoll_pwait(2)
> >        call. It is superior in two cases:
> >
> >        1) When epoll_ctl(2) are followed by epoll_wait(2), using epoll_mod_wait
> >        will save context switches between user mode and kernel mode;
> >
> >        2) When you need higher precision than microsecond for wait timeout.
> >
> >        The epoll_ctl(2) operations are embedded into this call by with ncmds
> >        and cmds. The latter is an array of command structs:
> >
> >            struct epoll_mod_cmd {
> >
> >                   /* Reserved flags for future extension, must be 0 for now. */
> >                   int flags;
> >
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> >
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> >
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> >
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> >
> >                   /* Output field, will be set to the return code once this
> >                    * command is executed by kernel */
> >                   int error;
> >            };
> 
> I would add an extra u32 at the end so that the structure size will be
> a multiple of 8 bytes on all platforms.

OK, makes sense.

> 
> >
> >        There is no guartantee that all the commands are executed in order. Only
> >        if all the commands are successfully executed (all the error fields are
> >        set to 0), events are polled.
> 
> If this doesn't happen, what error is returned?

The last error in executing commands.

> 
> >            struct epoll_wait_spec {
> >
> >                   /* The same as "maxevents" in epoll_pwait() */
> >                   int maxevents;
> >
> >                   /* The same as "events" in epoll_pwait() */
> >                   struct epoll_event *events;
> >
> >                   /* Which clock to use for timeout */
> >                   int clockid;
> >
> >                   /* Maximum time to wait if there is no event */
> >                   struct timespec timeout;
> >
> >                   /* The same as "sigmask" in epoll_pwait() */
> >                   sigset_t *sigmask;
> >
> >                   /* The same as "sigsetsize" in epoll_pwait() */
> >                   size_t sigsetsize;
> >            } EPOLL_PACKED;
> 
> I think the convention is to align the structure's fields manually
> rather than declaring it to be packed.

OK.

> 
> >
> > RETURN VALUE
> >
> >        When any error occurs, epoll_mod_wait() returns -1 and errno is set
> >        appropriately. All the "error" fields in cmds are unchanged before they
> >        are executed, and if any cmds are executed, the "error" fields are set
> >        to a return code accordingly. See also epoll_ctl for more details of the
> >        return code.
> 
> Does this mean that callers should initialize the error fields to an
> impossible value first so they can tell which commands were executed?

Yes.

> 
> >
> >        When successful, epoll_mod_wait() returns the number of file
> >        descriptors ready for the requested I/O, or zero if no file descriptor
> >        became ready during the requested timeout milliseconds.
> >
> >        If spec is NULL, it returns 0 if all the commands are successful, and -1
> >        if an error occured.
> >
> > ERRORS
> >
> >        These errors apply on either the return value of epoll_mod_wait or error
> >        status for each command, respectively.
> 
> Please clarify which errors are returned overall and which are per-command.

OK.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
  2015-01-21  7:52         ` Michael Kerrisk (man-pages)
  (?)
@ 2015-01-21 10:34           ` Paolo Bonzini
  -1 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-21 10:34 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett



On 21/01/2015 08:52, Michael Kerrisk (man-pages) wrote:
>> > The problem is that there is no room for flags field in epoll_pwait1, which is
>> > asked for, in previous discussion thread [1].
> Ahh yes, I certainly should not have forgotten that. But that's easily solved.
> Do as for pselect6():
> 
> strcut sigargs {
>     const sigset_t *ss;
>     size_t          ss_len; /* Size (in bytes) of object pointed
>                                to by 'ss' */
> }
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, int flags,
>              int timeout,
>              const struct sigargs *sargs);
> 

Alternatively, place the clock_id in the lower 16 bits of flags.
MAX_CLOCKS is 16 right now, so there's room.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 10:34           ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-21 10:34 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra



On 21/01/2015 08:52, Michael Kerrisk (man-pages) wrote:
>> > The problem is that there is no room for flags field in epoll_pwait1, which is
>> > asked for, in previous discussion thread [1].
> Ahh yes, I certainly should not have forgotten that. But that's easily solved.
> Do as for pselect6():
> 
> strcut sigargs {
>     const sigset_t *ss;
>     size_t          ss_len; /* Size (in bytes) of object pointed
>                                to by 'ss' */
> }
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, int flags,
>              int timeout,
>              const struct sigargs *sargs);
> 

Alternatively, place the clock_id in the lower 16 bits of flags.
MAX_CLOCKS is 16 right now, so there's room.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 10:34           ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-21 10:34 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra



On 21/01/2015 08:52, Michael Kerrisk (man-pages) wrote:
>> > The problem is that there is no room for flags field in epoll_pwait1, which is
>> > asked for, in previous discussion thread [1].
> Ahh yes, I certainly should not have forgotten that. But that's easily solved.
> Do as for pselect6():
> 
> strcut sigargs {
>     const sigset_t *ss;
>     size_t          ss_len; /* Size (in bytes) of object pointed
>                                to by 'ss' */
> }
> 
> epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
>              struct timespec *timeout, int clock_id, int flags,
>              int timeout,
>              const struct sigargs *sargs);
> 

Alternatively, place the clock_id in the lower 16 bits of flags.
MAX_CLOCKS is 16 right now, so there's room.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 10:37             ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-21 10:37 UTC (permalink / raw)
  To: Fam Zheng, Michael Kerrisk (man-pages), Andy Lutomirski
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	linux-fsdevel, linux-api, Josh Triplett



On 21/01/2015 09:58, Fam Zheng wrote:
>> > See my comment in the earlier mail. If you split this into two 
>> > APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
>> > then the return value of epoll_ctl_batch() could be used to tell
>> > user space how many commands succeeded. Much simpler!
> Yes it is much simpler. However the reason to add batching in the first place is
> to make epoll faster, by reducing syscalls. Splitting makes the result
> sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
> proposed new call *is* a step forward, but I don't think we will have everything
> solved even by implementing them all. Compromise needed between performance or
> complexity.
> 
> My take for simplicity will be leaving epoll_ctl as-is, and my take for
> performance will be epoll_pwait1. And I don't really like putting my time on
> epoll_ctl_batch, thinking it as a ambivalent compromise in between.

I agree with Michael actually.  The big change is going from O(n)
epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
a fraction of a microsecond.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 10:37             ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-21 10:37 UTC (permalink / raw)
  To: Fam Zheng, Michael Kerrisk (man-pages), Andy Lutomirski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86-DgEjT+Ai2ygdnm+yROfE0A,
	Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	linux-fsdevel-fy+rA21nqHI



On 21/01/2015 09:58, Fam Zheng wrote:
>> > See my comment in the earlier mail. If you split this into two 
>> > APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
>> > then the return value of epoll_ctl_batch() could be used to tell
>> > user space how many commands succeeded. Much simpler!
> Yes it is much simpler. However the reason to add batching in the first place is
> to make epoll faster, by reducing syscalls. Splitting makes the result
> sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
> proposed new call *is* a step forward, but I don't think we will have everything
> solved even by implementing them all. Compromise needed between performance or
> complexity.
> 
> My take for simplicity will be leaving epoll_ctl as-is, and my take for
> performance will be epoll_pwait1. And I don't really like putting my time on
> epoll_ctl_batch, thinking it as a ambivalent compromise in between.

I agree with Michael actually.  The big change is going from O(n)
epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
a fraction of a microsecond.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
  2015-01-21 10:37             ` Paolo Bonzini
  (?)
@ 2015-01-21 11:14               ` Fam Zheng
  -1 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21 11:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael Kerrisk (man-pages),
	Andy Lutomirski, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett

On Wed, 01/21 11:37, Paolo Bonzini wrote:
> 
> 
> On 21/01/2015 09:58, Fam Zheng wrote:
> >> > See my comment in the earlier mail. If you split this into two 
> >> > APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
> >> > then the return value of epoll_ctl_batch() could be used to tell
> >> > user space how many commands succeeded. Much simpler!
> > Yes it is much simpler. However the reason to add batching in the first place is
> > to make epoll faster, by reducing syscalls. Splitting makes the result
> > sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
> > proposed new call *is* a step forward, but I don't think we will have everything
> > solved even by implementing them all. Compromise needed between performance or
> > complexity.
> > 
> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> > performance will be epoll_pwait1. And I don't really like putting my time on
> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> 
> I agree with Michael actually.  The big change is going from O(n)
> epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> a fraction of a microsecond.
> 

Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
doesn't change that radically from one iteration to another, does it?

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 11:14               ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21 11:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael Kerrisk (man-pages),
	Andy Lutomirski, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu

On Wed, 01/21 11:37, Paolo Bonzini wrote:
> 
> 
> On 21/01/2015 09:58, Fam Zheng wrote:
> >> > See my comment in the earlier mail. If you split this into two 
> >> > APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
> >> > then the return value of epoll_ctl_batch() could be used to tell
> >> > user space how many commands succeeded. Much simpler!
> > Yes it is much simpler. However the reason to add batching in the first place is
> > to make epoll faster, by reducing syscalls. Splitting makes the result
> > sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
> > proposed new call *is* a step forward, but I don't think we will have everything
> > solved even by implementing them all. Compromise needed between performance or
> > complexity.
> > 
> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> > performance will be epoll_pwait1. And I don't really like putting my time on
> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> 
> I agree with Michael actually.  The big change is going from O(n)
> epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> a fraction of a microsecond.
> 

Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
doesn't change that radically from one iteration to another, does it?

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 11:14               ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-21 11:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael Kerrisk (man-pages),
	Andy Lutomirski, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu

On Wed, 01/21 11:37, Paolo Bonzini wrote:
> 
> 
> On 21/01/2015 09:58, Fam Zheng wrote:
> >> > See my comment in the earlier mail. If you split this into two 
> >> > APIs, and epoll_ctl_batch() is guaranteed to execute 'cmds' in order, 
> >> > then the return value of epoll_ctl_batch() could be used to tell
> >> > user space how many commands succeeded. Much simpler!
> > Yes it is much simpler. However the reason to add batching in the first place is
> > to make epoll faster, by reducing syscalls. Splitting makes the result
> > sub-optimal: we still need at least 2 calls instead of 1.  Each one of the three
> > proposed new call *is* a step forward, but I don't think we will have everything
> > solved even by implementing them all. Compromise needed between performance or
> > complexity.
> > 
> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> > performance will be epoll_pwait1. And I don't really like putting my time on
> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> 
> I agree with Michael actually.  The big change is going from O(n)
> epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> a fraction of a microsecond.
> 

Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
doesn't change that radically from one iteration to another, does it?

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 11:50                 ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-21 11:50 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Michael Kerrisk (man-pages),
	Andy Lutomirski, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett



On 21/01/2015 12:14, Fam Zheng wrote:
> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> > performance will be epoll_pwait1. And I don't really like putting my time on
> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> 
> > I agree with Michael actually.  The big change is going from O(n)
> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> > a fraction of a microsecond.
> 
> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
> doesn't change that radically from one iteration to another, does it?

That depends on the application.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 11:50                 ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-21 11:50 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Michael Kerrisk (man-pages),
	Andy Lutomirski, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro, Andrew Morton,
	Kees Cook, David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu



On 21/01/2015 12:14, Fam Zheng wrote:
> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> > performance will be epoll_pwait1. And I don't really like putting my time on
> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> 
> > I agree with Michael actually.  The big change is going from O(n)
> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> > a fraction of a microsecond.
> 
> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
> doesn't change that radically from one iteration to another, does it?

That depends on the application.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-21 11:50                 ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-21 11:50 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Michael Kerrisk (man-pages),
	Andy Lutomirski, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86-DgEjT+Ai2ygdnm+yROfE0A, Alexander Viro, Andrew Morton,
	Kees Cook, David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins, Mathieu



On 21/01/2015 12:14, Fam Zheng wrote:
> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> > performance will be epoll_pwait1. And I don't really like putting my time on
> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> 
> > I agree with Michael actually.  The big change is going from O(n)
> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> > a fraction of a microsecond.
> 
> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
> doesn't change that radically from one iteration to another, does it?

That depends on the application.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
  2015-01-21 11:50                 ` Paolo Bonzini
  (?)
@ 2015-01-22 21:12                   ` Andy Lutomirski
  -1 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2015-01-22 21:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Michael Kerrisk (man-pages),
	linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	Linux FS Devel, Linux API, Josh Triplett

On Wed, Jan 21, 2015 at 3:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 21/01/2015 12:14, Fam Zheng wrote:
>> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
>> > performance will be epoll_pwait1. And I don't really like putting my time on
>> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
>>
>> > I agree with Michael actually.  The big change is going from O(n)
>> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
>> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
>> > a fraction of a microsecond.
>>
>> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
>> doesn't change that radically from one iteration to another, does it?
>
> That depends on the application.

In my application, the set of fds almost never changes, but the set of
events I want changes all the time.  The main thing that changes is
whether I care about EPOLLOUT.  If I'm ready to send something, then I
want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-22 21:12                   ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2015-01-22 21:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Michael Kerrisk (man-pages),
	linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins

On Wed, Jan 21, 2015 at 3:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 21/01/2015 12:14, Fam Zheng wrote:
>> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
>> > performance will be epoll_pwait1. And I don't really like putting my time on
>> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
>>
>> > I agree with Michael actually.  The big change is going from O(n)
>> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
>> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
>> > a fraction of a microsecond.
>>
>> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
>> doesn't change that radically from one iteration to another, does it?
>
> That depends on the application.

In my application, the set of fds almost never changes, but the set of
events I want changes all the time.  The main thing that changes is
whether I care about EPOLLOUT.  If I'm ready to send something, then I
want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-22 21:12                   ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2015-01-22 21:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Fam Zheng, Michael Kerrisk (man-pages),
	linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins

On Wed, Jan 21, 2015 at 3:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 21/01/2015 12:14, Fam Zheng wrote:
>> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
>> > performance will be epoll_pwait1. And I don't really like putting my time on
>> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
>>
>> > I agree with Michael actually.  The big change is going from O(n)
>> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
>> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
>> > a fraction of a microsecond.
>>
>> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
>> doesn't change that radically from one iteration to another, does it?
>
> That depends on the application.

In my application, the set of fds almost never changes, but the set of
events I want changes all the time.  The main thing that changes is
whether I care about EPOLLOUT.  If I'm ready to send something, then I
want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-23  6:20                     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-23  6:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Michael Kerrisk (man-pages),
	linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	Linux FS Devel, Linux API, Josh Triplett

On Thu, 01/22 13:12, Andy Lutomirski wrote:
> On Wed, Jan 21, 2015 at 3:50 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 21/01/2015 12:14, Fam Zheng wrote:
> >> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> >> > performance will be epoll_pwait1. And I don't really like putting my time on
> >> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> >>
> >> > I agree with Michael actually.  The big change is going from O(n)
> >> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> >> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> >> > a fraction of a microsecond.
> >>
> >> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
> >> doesn't change that radically from one iteration to another, does it?
> >
> > That depends on the application.
> 
> In my application, the set of fds almost never changes, but the set of
> events I want changes all the time.  The main thing that changes is
> whether I care about EPOLLOUT.  If I'm ready to send something, then I
> want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.
> 

OK, I'll split it to epoll_ctl_batch and epoll_pwait1 as Micheal suggested in
v2.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-23  6:20                     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-23  6:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Michael Kerrisk (man-pages),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh

On Thu, 01/22 13:12, Andy Lutomirski wrote:
> On Wed, Jan 21, 2015 at 3:50 AM, Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >
> > On 21/01/2015 12:14, Fam Zheng wrote:
> >> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> >> > performance will be epoll_pwait1. And I don't really like putting my time on
> >> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> >>
> >> > I agree with Michael actually.  The big change is going from O(n)
> >> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> >> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> >> > a fraction of a microsecond.
> >>
> >> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
> >> doesn't change that radically from one iteration to another, does it?
> >
> > That depends on the application.
> 
> In my application, the set of fds almost never changes, but the set of
> events I want changes all the time.  The main thing that changes is
> whether I care about EPOLLOUT.  If I'm ready to send something, then I
> want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.
> 

OK, I'll split it to epoll_ctl_batch and epoll_pwait1 as Micheal suggested in
v2.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-23  6:20                     ` Fam Zheng
  0 siblings, 0 replies; 80+ messages in thread
From: Fam Zheng @ 2015-01-23  6:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Michael Kerrisk (man-pages),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh

On Thu, 01/22 13:12, Andy Lutomirski wrote:
> On Wed, Jan 21, 2015 at 3:50 AM, Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >
> > On 21/01/2015 12:14, Fam Zheng wrote:
> >> > My take for simplicity will be leaving epoll_ctl as-is, and my take for
> >> > performance will be epoll_pwait1. And I don't really like putting my time on
> >> > epoll_ctl_batch, thinking it as a ambivalent compromise in between.
> >>
> >> > I agree with Michael actually.  The big change is going from O(n)
> >> > epoll_ctl calls to O(1), and epoll_ctl_batch achieves that just fine.
> >> > Changing 2 syscalls to 1 is the icing on the cake, but we're talking of
> >> > a fraction of a microsecond.
> >>
> >> Maybe I'm missing something, but in common cases, the set of fds for epoll_wait
> >> doesn't change that radically from one iteration to another, does it?
> >
> > That depends on the application.
> 
> In my application, the set of fds almost never changes, but the set of
> events I want changes all the time.  The main thing that changes is
> whether I care about EPOLLOUT.  If I'm ready to send something, then I
> want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.
> 

OK, I'll split it to epoll_ctl_batch and epoll_pwait1 as Micheal suggested in
v2.

Fam

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-23  9:56                     ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-23  9:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fam Zheng, Michael Kerrisk (man-pages),
	linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	Linux FS Devel, Linux API, Josh Triplett



On 22/01/2015 22:12, Andy Lutomirski wrote:
> In my application, the set of fds almost never changes, but the set of
> events I want changes all the time.  The main thing that changes is
> whether I care about EPOLLOUT.  If I'm ready to send something, then I
> want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.

Yes, this is almost always the case unless you use EPOLLET.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-23  9:56                     ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-23  9:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fam Zheng, Michael Kerrisk (man-pages),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins



On 22/01/2015 22:12, Andy Lutomirski wrote:
> In my application, the set of fds almost never changes, but the set of
> events I want changes all the time.  The main thing that changes is
> whether I care about EPOLLOUT.  If I'm ready to send something, then I
> want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.

Yes, this is almost always the case unless you use EPOLLET.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
@ 2015-01-23  9:56                     ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2015-01-23  9:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fam Zheng, Michael Kerrisk (man-pages),
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86 ML, Alexander Viro,
	Andrew Morton, Kees Cook, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins



On 22/01/2015 22:12, Andy Lutomirski wrote:
> In my application, the set of fds almost never changes, but the set of
> events I want changes all the time.  The main thing that changes is
> whether I care about EPOLLOUT.  If I'm ready to send something, then I
> want EPOLLOUT.  If I'm not ready, then I don't want EPOLLOUT.

Yes, this is almost always the case unless you use EPOLLET.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2015-01-23  9:58 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-20  9:57 [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait" Fam Zheng
2015-01-20  9:57 ` Fam Zheng
2015-01-20  9:57 ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 1/6] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 2/6] epoll: Specify clockid explicitly Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 3/6] epoll: Add definition for epoll_mod_wait structures Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 4/6] epoll: Extract ep_ctl_do Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20 12:50   ` Michael Kerrisk (man-pages)
2015-01-20 12:50     ` Michael Kerrisk (man-pages)
2015-01-20 12:50     ` Michael Kerrisk (man-pages)
2015-01-21  4:59     ` Fam Zheng
2015-01-21  4:59       ` Fam Zheng
2015-01-21  4:59       ` Fam Zheng
2015-01-21  7:52       ` Michael Kerrisk (man-pages)
2015-01-21  7:52         ` Michael Kerrisk (man-pages)
2015-01-21  7:52         ` Michael Kerrisk (man-pages)
2015-01-21  8:58         ` Fam Zheng
2015-01-21  8:58           ` Fam Zheng
2015-01-21 10:37           ` Paolo Bonzini
2015-01-21 10:37             ` Paolo Bonzini
2015-01-21 11:14             ` Fam Zheng
2015-01-21 11:14               ` Fam Zheng
2015-01-21 11:14               ` Fam Zheng
2015-01-21 11:50               ` Paolo Bonzini
2015-01-21 11:50                 ` Paolo Bonzini
2015-01-21 11:50                 ` Paolo Bonzini
2015-01-22 21:12                 ` Andy Lutomirski
2015-01-22 21:12                   ` Andy Lutomirski
2015-01-22 21:12                   ` Andy Lutomirski
2015-01-23  6:20                   ` Fam Zheng
2015-01-23  6:20                     ` Fam Zheng
2015-01-23  6:20                     ` Fam Zheng
2015-01-23  9:56                   ` Paolo Bonzini
2015-01-23  9:56                     ` Paolo Bonzini
2015-01-23  9:56                     ` Paolo Bonzini
2015-01-21 10:34         ` Paolo Bonzini
2015-01-21 10:34           ` Paolo Bonzini
2015-01-21 10:34           ` Paolo Bonzini
2015-01-21  7:56   ` Omar Sandoval
2015-01-21  7:56     ` Omar Sandoval
2015-01-21  7:56     ` Omar Sandoval
2015-01-21  8:59     ` Fam Zheng
2015-01-21  8:59       ` Fam Zheng
2015-01-21  8:59       ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 6/6] x86: Hook up epoll_mod_wait syscall Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20 10:37 ` [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait" Rasmus Villemoes
2015-01-20 10:37   ` Rasmus Villemoes
2015-01-20 10:53   ` Fam Zheng
2015-01-20 10:53     ` Fam Zheng
2015-01-20 12:48 ` Michael Kerrisk (man-pages)
2015-01-20 12:48   ` Michael Kerrisk (man-pages)
2015-01-20 12:48   ` Michael Kerrisk (man-pages)
2015-01-21  9:05   ` Fam Zheng
2015-01-21  9:05     ` Fam Zheng
2015-01-21  9:05     ` Fam Zheng
2015-01-20 22:40 ` Andy Lutomirski
2015-01-20 22:40   ` Andy Lutomirski
2015-01-20 22:40   ` Andy Lutomirski
2015-01-20 23:03   ` josh
2015-01-20 23:03     ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-01-20 23:03     ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-01-21  5:55   ` Michael Kerrisk (man-pages)
2015-01-21  5:55     ` Michael Kerrisk (man-pages)
2015-01-21  5:55     ` Michael Kerrisk (man-pages)
2015-01-21  9:07   ` Fam Zheng
2015-01-21  9:07     ` Fam Zheng
2015-01-21  9:07     ` Fam Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.