linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
@ 2015-02-13  9:03 Fam Zheng
  2015-02-13  9:03 ` [PATCH RFC v3 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-13  9:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Hi all,

This is the updated series for the new epoll system calls, with the cover
letter rewritten which includes some more explanation. Comments are very
welcome!

Original Motivation
===================

QEMU, and probably many more select/poll based applications, will consider
epoll as an alternative, when its event loop needs to handle a big number of
fds. However, there are currently two concerns with epoll which prevents the
switching from ppoll to epoll. 

The major one is the timeout precision.

For example in QEMU, the main loop takes care of calling callbacks at a
specific timeout - the QEMU timer API. The timeout value in ppoll depends on
the next firing timer. epoll_pwait's millisecond timeout is so coarse that
rounding up the timeout will hurt performance badly.

The minor one is the number of system call to update fd set.

While epoll can handle a large number of fds quickly, it still requires one
epoll_ctl per fd update, compared to the one-shot call to select/poll with an
fd array. This may as well make epoll inferior to ppoll in the cases where a
small, but frequently changing set of fds are polled by the event loop.

This series introduces two new epoll APIs to address them respectively. The
idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
suggested clockid as a parameter in epoll_pwait1.

Discussion
==========

[Note: This is the question part regarding the interface contract of
epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch yet,
please skip this part and probably start with the man page style documentation.
You can resume to this section later.]

[Thanks to Omar Sandoval <osandov@osandov.com>, who pointed out this in
reviewing v1]

We try to report status for each command in epoll_ctl_batch, by writting to
user provided command array (pointed to cmds). The tricky thing in the
implementation is that, copying the results back to userspace comes last, after
the commands are executed. At this point, if the copy_to_user fails, the
effects are done and no return - or if we add some mechanism to revert it, the
code will be too complicated and slow.

In above corner case, the return value of epoll_ctl_batch is smaller than
ncmds, which assures our caller that last N commands failed, where N = ncmds -
ret.  But they'll also find that cmd.result is not changed to error code.

I suppose this is not a big deal, because 1) it should happen very rarely. 2)
user does know the actual number of commands that succeed.

So, do we leave it as is? Or is there any way to improve?

One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
before we execute the commands. If it succeeds, it's even less likely the last
copy_to_user could fail, so that we can even probably assert it won't. The
testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
performance impact, especially when @ncmds is big.

Links
=====

[1]: http://lists.openwall.net/linux-kernel/2015/01/08/542

Changelog
=========

Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
----------------------------------------------------

  - Rename epoll_ctl_cmd.error_hint to "result". [Michael]

  - Add background introduction in cover letter. [Michael]

  - Expand the last struct of epoll_pwait1, add clockid and timespec.
  
  - Update man page in cover letter accordingly:

    * "error_hint" -> "result".
    * The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.

  Please review!

Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
-----------------------------------------------------

  - As discussed in previous thread [1], split the call to epoll_ctl_batch and
    epoll_pwait. [Michael]

  - Fix memory leaks. [Omar]

  - Add a short comment about the ignored copy_to_user failure. [Omar]

  - Cover letter rewritten.

Documentation of the new system calls
=====================================

[I tried to write in the familiar man page style, but I am not proficient on
this. Thanks for Michael's help and suggestions in helping me improve it
through previous discussions.]

1) epoll_ctl_batch
------------------

NAME
       epoll_ctl_batch - modify an epoll descriptor in batch

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_ctl_batch(int epfd, int flags,
                           int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

       The system call performs a batch of epoll_ctl operations. It allows
       efficient update of events on this epoll descriptor.

       Flags is reserved and must be 0.

       Each operation is specified as an element in the cmds array, defined as:

           struct epoll_ctl_cmd {

                  /* Reserved flags for future extension, must be 0. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code after this
                   * command is executed by kernel */
                  int result;
           };

       Commands are executed in their order in cmds, and if one of them failed,
       the rest after it are not tried.

       Not that this call isn't atomic in terms of updating the epoll
       descriptor, which means a second epoll_ctl or epoll_ctl_batch call
       during the first epoll_ctl_batch may make the operation sequence
       interleaved. However, each single epoll_ctl_cmd operation has the same
       semantics as a epoll_ctl call.

RETURN VALUE

       If one or more of the parameters are incorrect, -1 is returned and errno
       is set appropriately. Otherwise, the number of succeeded commands is
       returned.

       Each 'result' field may be set to the error code or 0, depending on the
       result of the command. If the kernel fails to set the field after the
       commands are completed or failed, it may also be unchanged, even though
       the effects of successful commands are done. In this case, it's still
       ensured that 1) the i-th (i = ret) command is the failed command; 2) all
       the preceeding commands are successfully executed; 3) all the subsequent
       ones are not executed.

ERRORS

       Errors for the overall system call (in errno) are:

       EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
              cmds is NULL.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       EFAULT The memory area pointed to by cmds is not accessible with write
              permissions.


       Errors for each command (in result field) are:

       EBADF  epfd or fd is not a valid file descriptor.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
              already registered with this epoll instance.

       EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
              or the requested operation op is not supported by this interface.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
              with this epoll instance.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
              encountered while trying to register (EPOLL_CTL_ADD) a new file
              descriptor on an epoll instance.  See epoll(7) for further
              details.

       EPERM  The target file fd does not support epoll.

CONFORMING TO

       epoll_ctl_batch() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)


2) epoll_pwait1
---------------

NAME
       epoll_pwait1 - wait for an I/O event on an epoll file descriptor

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_pwait1(int epfd, int flags,
                        struct epoll_event *events,
                        int maxevents,
                        struct epoll_wait_params *params);

DESCRIPTION

       The epoll_pwait1 system call differs from epoll_pwait only in
       parameter list. The epfd, events and maxevents parameters are the same
       as in epoll_wait and epoll_pwait. The flags and params are new.

       The flags is reserved and must be zero.

       The params is a pointer to a structure parameters for the epoll_pwait1,
       defined as:

           struct epoll_wait_params {
               int clockid;
               struct timespec timeout;
               sigset_t *sigmask;
               size_t sigsetsize;
           };

       The field clockid must be either CLOCK_REALTIME or CLOCK_MONOTONIC, to
       choose the clock type to use for timeout. Note that CLOCK_MONOTONIC is
       the implicit and only possible clock type for epoll_pwait.
       
       The timeout field specifies the minimum time that epoll_wait() will
       block.  (The effective length will be rounded up to the clock
       granularity, and kernel scheduling delays mean that the blocking
       interval may overrun by a small amount.) Specifying a nagative time
       lenght (for example, timeout.tv_sec = 0 and timeout.tv_nsec = -1, or the
       other way round) causes epoll_pwait1() to block indefinitely, while
       specifying a timeout equal to zero (both fields in timeout are zero)
       causes epoll_wait() to return immediately, even if no events are
       available.

       The sigmask and sigsetsize has the same semantics as epoll_pwait. The
       sigmask field may be specified as NULL, in which case epoll_pwait1()
       will behave like epoll_wait.

   User visibility of sigsetsize

       In epoll_pwait and other syscalls, sigsetsize is not visible to
       application developer as glibc has a wrapper around epoll_pwait system
       call. Now that we pack several parameters in epoll_wait_params, so in
       order to hide sigsetsize from application code, we can still wrap it,
       either by expanding parameters and build the structure in the wrapper
       function, or only ask application to provide the first half:

           struct epoll_wait_params_user {
               int clockid;
               struct timespec timeout;
               sigset_t *sigmask;
           };

      Then in the wrapper function copy to a full structure and fill in
      sigsetsize.

RETURN VALUE

       The same as said in epoll_pwait(2).

ERRORS

       The same as said in man epoll_pwait(2), plus:

       EINVAL flags is not zero, or clockid is neither CLOCK_REALTIME nor
              CLOCK_MONOTONIC.

       EFAULT The memory area pointed to by params is not accessible.


CONFORMING TO

       epoll_pwait1() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (7):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Specify clockid explicitly
  epoll: Extract ep_ctl_do
  epoll: Add implementation for epoll_ctl_batch
  x86: Hook up epoll_ctl_batch syscall
  epoll: Add implementation for epoll_pwait1
  x86: Hook up epoll_pwait1 syscall

 arch/x86/syscalls/syscall_32.tbl |   2 +
 arch/x86/syscalls/syscall_64.tbl |   2 +
 fs/eventpoll.c                   | 258 +++++++++++++++++++++++++--------------
 include/linux/syscalls.h         |   9 ++
 include/uapi/linux/eventpoll.h   |  18 +++
 5 files changed, 199 insertions(+), 90 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH RFC v3 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
@ 2015-02-13  9:03 ` Fam Zheng
  2015-02-13  9:03 ` [PATCH RFC v3 2/7] epoll: Specify clockid explicitly Fam Zheng
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-13  9:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

In preparation of new epoll syscalls, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 162 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 75 insertions(+), 87 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
 	return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
 }
 
-static inline struct timespec ep_set_mstimeout(long ms)
-{
-	struct timespec now, ts = {
-		.tv_sec = ms / MSEC_PER_SEC,
-		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
-	};
-
-	ktime_get_ts(&now);
-	return timespec_add_safe(now, ts);
-}
-
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller supplied
  *           event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
  * @events: Pointer to the userspace buffer where the ready events should be
  *          stored.
  * @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- *           milliseconds. If the @timeout is zero, the function will not block,
- *           while if the @timeout is less than zero, the function will block
- *           until at least one event has been retrieved (or an error
- *           occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation.  If 0, the
+ *           function will not block. If negative, the function will block until
+ *           at least one event has been retrieved (or an error occurred).
  *
  * Returns: Returns the number of ready events which have been fetched, or an
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout)
+		   int maxevents, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
 
-	if (timeout > 0) {
-		struct timespec end_time = ep_set_mstimeout(timeout);
-
-		slack = select_estimate_accuracy(&end_time);
-		to = &expires;
-		*to = timespec_to_ktime(end_time);
-	} else if (timeout == 0) {
+	if (!ktime_to_ns(timeout)) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
 		 * caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		timed_out = 1;
 		spin_lock_irqsave(&ep->lock, flags);
 		goto check_events;
+	} else if (ktime_to_ns(timeout) > 0) {
+		struct timespec now, end_time;
+
+		ktime_get_ts(&now);
+		end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+		slack = select_estimate_accuracy(&end_time);
+		to = &expires;
+		*to = timespec_to_ktime(end_time);
 	}
 
 fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
 	return error;
 }
 
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+				int maxevents, const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -2002,91 +1988,93 @@ error_fput:
 
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_wait(2).
+ */
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+				 int maxevents, ktime_t timeout,
+				 sigset_t *sigmask, size_t sigsetsize)
+{
+	int error;
+	sigset_t sigsaved;
+
+	/*
+	 * If the caller wants a certain signal mask to be set during the wait,
+	 * we apply it here.
+	 */
+	if (sigmask) {
+		sigsaved = current->blocked;
+		set_current_blocked(sigmask);
+	}
+
+	error = epoll_wait_do(epfd, events, maxevents, timeout);
+
+	/*
+	 * If we changed the signal mask, we need to restore the original one.
+	 * In case we've got a signal while waiting, we do not restore the
+	 * signal mask yet, and we allow do_signal() to deliver the signal on
+	 * the way back to userspace, before the signal mask is restored.
+	 */
+	if (sigmask) {
+		if (error == -EINTR) {
+			memcpy(&current->saved_sigmask, &sigsaved,
+			       sizeof(sigsaved));
+			set_restore_sigmask();
+		} else
+			set_current_blocked(&sigsaved);
+	}
+
+	return error;
+}
+
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
  * part of the user space epoll_pwait(2).
  */
 SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
 		size_t, sigsetsize)
 {
-	int error;
-	sigset_t ksigmask, sigsaved;
+	ktime_t kt = ms_to_ktime(timeout);
+	sigset_t ksigmask;
 
-	/*
-	 * If the caller wants a certain signal mask to be set during the wait,
-	 * we apply it here.
-	 */
 	if (sigmask) {
 		if (sigsetsize != sizeof(sigset_t))
 			return -EINVAL;
 		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
 			return -EFAULT;
-		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
 	}
-
-	error = sys_epoll_wait(epfd, events, maxevents, timeout);
-
-	/*
-	 * If we changed the signal mask, we need to restore the original one.
-	 * In case we've got a signal while waiting, we do not restore the
-	 * signal mask yet, and we allow do_signal() to deliver the signal on
-	 * the way back to userspace, before the signal mask is restored.
-	 */
-	if (sigmask) {
-		if (error == -EINTR) {
-			memcpy(&current->saved_sigmask, &sigsaved,
-			       sizeof(sigsaved));
-			set_restore_sigmask();
-		} else
-			set_current_blocked(&sigsaved);
-	}
-
-	return error;
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
-			struct epoll_event __user *, events,
-			int, maxevents, int, timeout,
-			const compat_sigset_t __user *, sigmask,
-			compat_size_t, sigsetsize)
+		       struct epoll_event __user *, events,
+		       int, maxevents, int, timeout,
+		       const compat_sigset_t __user *, sigmask,
+		       compat_size_t, sigsetsize)
 {
-	long err;
 	compat_sigset_t csigmask;
-	sigset_t ksigmask, sigsaved;
+	sigset_t ksigmask;
+	ktime_t kt = ms_to_ktime(timeout);
 
-	/*
-	 * If the caller wants a certain signal mask to be set during the wait,
-	 * we apply it here.
-	 */
 	if (sigmask) {
 		if (sigsetsize != sizeof(compat_sigset_t))
 			return -EINVAL;
 		if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
 			return -EFAULT;
 		sigset_from_compat(&ksigmask, &csigmask);
-		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
 	}
 
-	err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
-	/*
-	 * If we changed the signal mask, we need to restore the original one.
-	 * In case we've got a signal while waiting, we do not restore the
-	 * signal mask yet, and we allow do_signal() to deliver the signal on
-	 * the way back to userspace, before the signal mask is restored.
-	 */
-	if (sigmask) {
-		if (err == -EINTR) {
-			memcpy(&current->saved_sigmask, &sigsaved,
-			       sizeof(sigsaved));
-			set_restore_sigmask();
-		} else
-			set_current_blocked(&sigsaved);
-	}
-
-	return err;
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC v3 2/7] epoll: Specify clockid explicitly
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
  2015-02-13  9:03 ` [PATCH RFC v3 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
@ 2015-02-13  9:03 ` Fam Zheng
  2015-02-13  9:03 ` [PATCH RFC v3 3/7] epoll: Extract ep_ctl_do Fam Zheng
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-13  9:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..5840b03 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, const ktime_t timeout)
+		   int maxevents, int clockid, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1578,6 +1578,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
 
+	if (clockid != CLOCK_MONOTONIC && clockid != CLOCK_REALTIME)
+		return -EINVAL;
 	if (!ktime_to_ns(timeout)) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
@@ -1624,7 +1626,8 @@ fetch_events:
 			}
 
 			spin_unlock_irqrestore(&ep->lock, flags);
-			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+			if (!schedule_hrtimeout_range_clock(to, slack,
+						HRTIMER_MODE_ABS, clockid))
 				timed_out = 1;
 
 			spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1948,8 @@ error_return:
 }
 
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
-				int maxevents, const ktime_t timeout)
+				int maxevents, int clockid,
+				const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -1979,7 +1983,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 	ep = f.file->private_data;
 
 	/* Time to fish for events ... */
-	error = ep_poll(ep, events, maxevents, timeout);
+	error = ep_poll(ep, events, maxevents, clockid, timeout);
 
 error_fput:
 	fdput(f);
@@ -1994,12 +1998,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 		int, maxevents, int, timeout)
 {
 	ktime_t kt = ms_to_ktime(timeout);
-	return epoll_wait_do(epfd, events, maxevents, kt);
+	return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
 }
 
 static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
-				 int maxevents, ktime_t timeout,
-				 sigset_t *sigmask, size_t sigsetsize)
+				 int maxevents,
+				 int clockid, ktime_t timeout,
+				 sigset_t *sigmask)
 {
 	int error;
 	sigset_t sigsaved;
@@ -2013,7 +2018,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
 		set_current_blocked(sigmask);
 	}
 
-	error = epoll_wait_do(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2055,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
 			return -EFAULT;
 	}
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 
 #ifdef CONFIG_COMPAT
@@ -2073,8 +2078,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		sigset_from_compat(&ksigmask, &csigmask);
 	}
 
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC v3 3/7] epoll: Extract ep_ctl_do
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
  2015-02-13  9:03 ` [PATCH RFC v3 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
  2015-02-13  9:03 ` [PATCH RFC v3 2/7] epoll: Specify clockid explicitly Fam Zheng
@ 2015-02-13  9:03 ` Fam Zheng
  2015-02-13  9:04 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-13  9:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_ctl_wait.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 5840b03..6a2b0a4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1810,22 +1810,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
 {
 	int error;
 	int full_check = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
 
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
 	error = -EBADF;
 	f = fdget(epfd);
 	if (!f.file)
@@ -1947,6 +1940,23 @@ error_return:
 	return error;
 }
 
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	struct epoll_event epds;
+
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		return -EFAULT;
+
+	return ep_ctl_do(epfd, op, fd, epds);
+}
+
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 				int maxevents, int clockid,
 				const ktime_t timeout)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (2 preceding siblings ...)
  2015-02-13  9:03 ` [PATCH RFC v3 3/7] epoll: Extract ep_ctl_do Fam Zheng
@ 2015-02-13  9:04 ` Fam Zheng
  2015-02-13  9:04 ` [PATCH RFC v3 5/7] x86: Hook up epoll_ctl_batch syscall Fam Zheng
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-13  9:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

This new syscall is a batched version of epoll_ctl. It will execute each
command as specified in cmds in given order, and stop at first failure
or upon completion of all commands.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c                 | 48 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h       |  4 ++++
 include/uapi/linux/eventpoll.h | 11 ++++++++++
 3 files changed, 63 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 6a2b0a4..12e2e63 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2069,6 +2069,54 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL);
 }
 
+SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
+		int, ncmds, struct epoll_ctl_cmd __user *, cmds)
+{
+	struct epoll_ctl_cmd *kcmds = NULL;
+	int i, r, ret = 0;
+	int cmd_size;
+
+	if (flags)
+		return -EINVAL;
+	if (ncmds <= 0 || !cmds)
+		return -EINVAL;
+	cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
+	kcmds = kmalloc(cmd_size, GFP_KERNEL);
+	if (!kcmds)
+		return -ENOMEM;
+	if (copy_from_user(kcmds, cmds, cmd_size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	for (i = 0; i < ncmds; i++) {
+		struct epoll_event ev = (struct epoll_event) {
+			.events = kcmds[i].events,
+			.data = kcmds[i].data,
+		};
+		if (kcmds[i].flags) {
+			kcmds[i].result = -EINVAL;
+			goto copy;
+		}
+		kcmds[i].result = ep_ctl_do(epfd, kcmds[i].op,
+					    kcmds[i].fd, ev);
+		if (kcmds[i].result)
+			goto copy;
+		ret++;
+	}
+copy:
+	r = copy_to_user(cmds, kcmds,
+			 sizeof(struct epoll_ctl_cmd) * ncmds);
+	/* Failing to copy the command results back will leave
+	 * userspace no way to know the actual error code, but we still
+	 * report the number of succeeded commands with ret, so it's
+	 * not a big problem. Ignore it for now.
+	 */
+	(void) r;
+out:
+	kfree(kcmds);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..7d784e3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,7 @@
 #define _LINUX_SYSCALLS_H
 
 struct epoll_event;
+struct epoll_ctl_cmd;
 struct iattr;
 struct inode;
 struct iocb;
@@ -634,6 +635,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
+				    int ncmds,
+				    struct epoll_ctl_cmd __user *cmds);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..4e18b17 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
 #include <linux/fcntl.h>
 #include <linux/types.h>
 
+#include <linux/signal.h>
+
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
 
@@ -61,6 +63,15 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+struct epoll_ctl_cmd {
+	int flags;
+	int op;
+	int fd;
+	__u32 events;
+	__u64 data;
+	int result;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC v3 5/7] x86: Hook up epoll_ctl_batch syscall
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (3 preceding siblings ...)
  2015-02-13  9:04 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
@ 2015-02-13  9:04 ` Fam Zheng
  2015-02-13  9:04 ` [PATCH RFC v3 6/7] epoll: Add implementation for epoll_pwait1 Fam Zheng
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-13  9:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..fe809f6 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	epoll_ctl_batch		sys_epoll_ctl_batch
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..67b2ea4 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	epoll_ctl_batch		sys_epoll_ctl_batch
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC v3 6/7] epoll: Add implementation for epoll_pwait1
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (4 preceding siblings ...)
  2015-02-13  9:04 ` [PATCH RFC v3 5/7] x86: Hook up epoll_ctl_batch syscall Fam Zheng
@ 2015-02-13  9:04 ` Fam Zheng
  2015-02-13  9:04 ` [PATCH RFC v3 7/7] x86: Hook up epoll_pwait1 syscall Fam Zheng
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-13  9:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

This is the new implementation for poll which has a flags parameter and
packs a number of parameters into a structure.

The main advantage of it over existing epoll_pwait is about timeout:
epoll_pwait expects a relative millisecond value, while epoll_pwait1
accepts 1) a timespec which is in nanosecond granularity; 2) a clockid
to allow using a clock other than CLOCK_MONOTONIC.

The 'flags' field in params is reserved for now and must be zero. The
next step would be allowing absolute timeout value.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c                 | 27 +++++++++++++++++++++++++++
 include/linux/syscalls.h       |  5 +++++
 include/uapi/linux/eventpoll.h |  7 +++++++
 3 files changed, 39 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 12e2e63..cc51e8c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2117,6 +2117,33 @@ out:
 	return ret;
 }
 
+SYSCALL_DEFINE5(epoll_pwait1, int, epfd, int, flags,
+		struct epoll_event __user *, events,
+		int, maxevents,
+		struct epoll_wait_params __user *, params)
+{
+	struct epoll_wait_params p;
+	ktime_t kt = { 0 };
+	sigset_t sigmask;
+
+	if (flags)
+		return -EINVAL;
+	if (!params)
+		return -EINVAL;
+	if (copy_from_user(&p, params, sizeof(p)))
+		return -EFAULT;
+	if (p.sigmask) {
+		if (copy_from_user(&sigmask, p.sigmask, sizeof(sigmask)))
+			return -EFAULT;
+		if (p.sigsetsize != sizeof(p.sigmask))
+			return -EINVAL;
+	}
+	kt = timespec_to_ktime(p.timeout);
+
+	return epoll_pwait_do(epfd, events, maxevents, p.clockid,
+			      kt, p.sigmask ? &sigmask : NULL);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7d784e3..a4823d9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -13,6 +13,7 @@
 
 struct epoll_event;
 struct epoll_ctl_cmd;
+struct epoll_wait_params;
 struct iattr;
 struct inode;
 struct iocb;
@@ -635,6 +636,10 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_pwait1(int epfd, int flags,
+				 struct epoll_event __user *events,
+				 int maxevents,
+				 struct epoll_wait_params __user *params);
 asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
 				    int ncmds,
 				    struct epoll_ctl_cmd __user *cmds);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 4e18b17..d35c591 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -72,6 +72,13 @@ struct epoll_ctl_cmd {
 	int result;
 } EPOLL_PACKED;
 
+struct epoll_wait_params {
+	int clockid;
+	struct timespec timeout;
+	sigset_t *sigmask;
+	size_t sigsetsize;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC v3 7/7] x86: Hook up epoll_pwait1 syscall
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (5 preceding siblings ...)
  2015-02-13  9:04 ` [PATCH RFC v3 6/7] epoll: Add implementation for epoll_pwait1 Fam Zheng
@ 2015-02-13  9:04 ` Fam Zheng
  2015-02-13  9:53 ` [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Omar Sandoval
  2015-02-15 22:00 ` Jonathan Corbet
  8 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-13  9:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index fe809f6..bf912d8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -366,3 +366,4 @@
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
 359	i386	epoll_ctl_batch		sys_epoll_ctl_batch
+360	i386	epoll_pwait1		sys_epoll_pwait1
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 67b2ea4..9246ad5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -330,6 +330,7 @@
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
 323	64	epoll_ctl_batch		sys_epoll_ctl_batch
+324	64	epoll_pwait1		sys_epoll_pwait1
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (6 preceding siblings ...)
  2015-02-13  9:04 ` [PATCH RFC v3 7/7] x86: Hook up epoll_pwait1 syscall Fam Zheng
@ 2015-02-13  9:53 ` Omar Sandoval
  2015-02-15  6:44   ` Fam Zheng
  2015-02-15 15:16   ` Michael Kerrisk (man-pages)
  2015-02-15 22:00 ` Jonathan Corbet
  8 siblings, 2 replies; 17+ messages in thread
From: Omar Sandoval @ 2015-02-13  9:53 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini

On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
> Hi all,
> 
> This is the updated series for the new epoll system calls, with the cover
> letter rewritten which includes some more explanation. Comments are very
> welcome!
> 
> Original Motivation
> ===================
> 
> QEMU, and probably many more select/poll based applications, will consider
> epoll as an alternative, when its event loop needs to handle a big number of
> fds. However, there are currently two concerns with epoll which prevents the
> switching from ppoll to epoll. 
> 
> The major one is the timeout precision.
> 
> For example in QEMU, the main loop takes care of calling callbacks at a
> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
> rounding up the timeout will hurt performance badly.
> 
> The minor one is the number of system call to update fd set.
> 
> While epoll can handle a large number of fds quickly, it still requires one
> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
> fd array. This may as well make epoll inferior to ppoll in the cases where a
> small, but frequently changing set of fds are polled by the event loop.
> 
> This series introduces two new epoll APIs to address them respectively. The
> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
> suggested clockid as a parameter in epoll_pwait1.
> 
> Discussion
> ==========
> 
> [Note: This is the question part regarding the interface contract of
> epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch yet,
> please skip this part and probably start with the man page style documentation.
> You can resume to this section later.]
> 
> [Thanks to Omar Sandoval <osandov@osandov.com>, who pointed out this in
> reviewing v1]
> 
> We try to report status for each command in epoll_ctl_batch, by writting to
> user provided command array (pointed to cmds). The tricky thing in the
> implementation is that, copying the results back to userspace comes last, after
> the commands are executed. At this point, if the copy_to_user fails, the
> effects are done and no return - or if we add some mechanism to revert it, the
> code will be too complicated and slow.
> 
> In above corner case, the return value of epoll_ctl_batch is smaller than
> ncmds, which assures our caller that last N commands failed, where N = ncmds -
> ret.  But they'll also find that cmd.result is not changed to error code.
> 
> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
> user does know the actual number of commands that succeed.
> 
> So, do we leave it as is? Or is there any way to improve?
> 
> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
> before we execute the commands. If it succeeds, it's even less likely the last
> copy_to_user could fail, so that we can even probably assert it won't. The
> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
> performance impact, especially when @ncmds is big.
> 
I don't think this is the right thing to do, since, for example, another
thread could unmap the memory region holding buffer between the "check"
copy_to_user and the actual one.

The two alternatives that I see are:

1. If copy_to_user fails, ignore it and return the number of commands
that succeeded (i.e., what the code in your patch does).
2. If copy_to_user fails, return -EFAULT, regardless of how many
commands succeeded.

The problem with 1 is that it could potentially mask bugs in a user
program. You could imagine a buggy program that passes a read-only
buffer to epoll_ctl_batch and never finds out about it because they
don't get the error. (Then, when there's a real error in one of the
epoll_ctl_cmds, but .result is 0, someone will be very confused.)

So I think 2 is the better option. Sure, the user will have no idea how
many commands were executed, but when would EFAULT on this system call
be part of normal operation, anyways? You'll find the memory bug, fix
it, and rest easy knowing that nothing is silently failing behind your
back.

> Links
> =====
> 
> [1]: http://lists.openwall.net/linux-kernel/2015/01/08/542
> 
> Changelog
> =========
> 
> Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
> ----------------------------------------------------
> 
>   - Rename epoll_ctl_cmd.error_hint to "result". [Michael]
> 
>   - Add background introduction in cover letter. [Michael]
> 
>   - Expand the last struct of epoll_pwait1, add clockid and timespec.
>   
>   - Update man page in cover letter accordingly:
> 
>     * "error_hint" -> "result".
>     * The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.
> 
>   Please review!
> 
> Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
> -----------------------------------------------------
> 
>   - As discussed in previous thread [1], split the call to epoll_ctl_batch and
>     epoll_pwait. [Michael]
> 
>   - Fix memory leaks. [Omar]
> 
>   - Add a short comment about the ignored copy_to_user failure. [Omar]
> 
>   - Cover letter rewritten.
> 
[snip]

-- 
Omar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-13  9:53 ` [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Omar Sandoval
@ 2015-02-15  6:44   ` Fam Zheng
  2015-02-15 15:16   ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-15  6:44 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-kernel, famz

On Fri, 02/13 01:53, Omar Sandoval wrote:
<snip>
> > Discussion
> > ==========
> > 
> > [Note: This is the question part regarding the interface contract of
> > epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch yet,
> > please skip this part and probably start with the man page style documentation.
> > You can resume to this section later.]
> > 
> > [Thanks to Omar Sandoval <osandov@osandov.com>, who pointed out this in
> > reviewing v1]
> > 
> > We try to report status for each command in epoll_ctl_batch, by writting to
> > user provided command array (pointed to cmds). The tricky thing in the
> > implementation is that, copying the results back to userspace comes last, after
> > the commands are executed. At this point, if the copy_to_user fails, the
> > effects are done and no return - or if we add some mechanism to revert it, the
> > code will be too complicated and slow.
> > 
> > In above corner case, the return value of epoll_ctl_batch is smaller than
> > ncmds, which assures our caller that last N commands failed, where N = ncmds -
> > ret.  But they'll also find that cmd.result is not changed to error code.
> > 
> > I suppose this is not a big deal, because 1) it should happen very rarely. 2)
> > user does know the actual number of commands that succeed.
> > 
> > So, do we leave it as is? Or is there any way to improve?
> > 
> > One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
> > before we execute the commands. If it succeeds, it's even less likely the last
> > copy_to_user could fail, so that we can even probably assert it won't. The
> > testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
> > right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
> > performance impact, especially when @ncmds is big.
> > 
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
> 
> The two alternatives that I see are:
> 
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
> 
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
> 
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.
> 

OK, I agree with you here. I'm going to change this to -EFAULT in the next
revision.

Thanks!
Fam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-13  9:53 ` [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Omar Sandoval
  2015-02-15  6:44   ` Fam Zheng
@ 2015-02-15 15:16   ` Michael Kerrisk (man-pages)
  1 sibling, 0 replies; 17+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-02-15 15:16 UTC (permalink / raw)
  To: Omar Sandoval, Fam Zheng
  Cc: mtk.manpages, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	Andy Lutomirski, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini

On 02/13/2015 10:53 AM, Omar Sandoval wrote:
> On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
>> Hi all,
>>
>> This is the updated series for the new epoll system calls, with the cover
>> letter rewritten which includes some more explanation. Comments are very
>> welcome!
>>
>> Original Motivation
>> ===================
>>
>> QEMU, and probably many more select/poll based applications, will consider
>> epoll as an alternative, when its event loop needs to handle a big number of
>> fds. However, there are currently two concerns with epoll which prevents the
>> switching from ppoll to epoll. 
>>
>> The major one is the timeout precision.
>>
>> For example in QEMU, the main loop takes care of calling callbacks at a
>> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
>> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
>> rounding up the timeout will hurt performance badly.
>>
>> The minor one is the number of system call to update fd set.
>>
>> While epoll can handle a large number of fds quickly, it still requires one
>> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
>> fd array. This may as well make epoll inferior to ppoll in the cases where a
>> small, but frequently changing set of fds are polled by the event loop.
>>
>> This series introduces two new epoll APIs to address them respectively. The
>> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
>> suggested clockid as a parameter in epoll_pwait1.
>>
>> Discussion
>> ==========
>>
>> [Note: This is the question part regarding the interface contract of
>> epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch yet,
>> please skip this part and probably start with the man page style documentation.
>> You can resume to this section later.]
>>
>> [Thanks to Omar Sandoval <osandov@osandov.com>, who pointed out this in
>> reviewing v1]
>>
>> We try to report status for each command in epoll_ctl_batch, by writting to
>> user provided command array (pointed to cmds). The tricky thing in the
>> implementation is that, copying the results back to userspace comes last, after
>> the commands are executed. At this point, if the copy_to_user fails, the
>> effects are done and no return - or if we add some mechanism to revert it, the
>> code will be too complicated and slow.
>>
>> In above corner case, the return value of epoll_ctl_batch is smaller than
>> ncmds, which assures our caller that last N commands failed, where N = ncmds -
>> ret.  But they'll also find that cmd.result is not changed to error code.
>>
>> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
>> user does know the actual number of commands that succeed.
>>
>> So, do we leave it as is? Or is there any way to improve?
>>
>> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
>> before we execute the commands. If it succeeds, it's even less likely the last
>> copy_to_user could fail, so that we can even probably assert it won't. The
>> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
>> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
>> performance impact, especially when @ncmds is big.
>>
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
> 
> The two alternatives that I see are:
> 
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
> 
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
> 
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.

What Omar says makes sense to me too. Best to have the user get a clear 
error indication for this case.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (7 preceding siblings ...)
  2015-02-13  9:53 ` [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Omar Sandoval
@ 2015-02-15 22:00 ` Jonathan Corbet
  2015-02-16  1:02   ` Fam Zheng
  8 siblings, 1 reply; 17+ messages in thread
From: Jonathan Corbet @ 2015-02-15 22:00 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

On Fri, 13 Feb 2015 17:03:56 +0800
Fam Zheng <famz@redhat.com> wrote:

> SYNOPSIS
> 
>        #include <sys/epoll.h>
> 
>        int epoll_pwait1(int epfd, int flags,
>                         struct epoll_event *events,
>                         int maxevents,
>                         struct epoll_wait_params *params);

Quick, possibly dumb question: might it make sense to also pass in 
sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
another parameter in the future, the kernel can tell which version is in
use and they won't have to do an epoll_pwait2()?

Thanks,

jon

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-15 22:00 ` Jonathan Corbet
@ 2015-02-16  1:02   ` Fam Zheng
  2015-02-16  7:25     ` Seymour, Shane M
  2015-02-18 18:49     ` Ingo Molnar
  0 siblings, 2 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-16  1:02 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

On Sun, 02/15 15:00, Jonathan Corbet wrote:
> On Fri, 13 Feb 2015 17:03:56 +0800
> Fam Zheng <famz@redhat.com> wrote:
> 
> > SYNOPSIS
> > 
> >        #include <sys/epoll.h>
> > 
> >        int epoll_pwait1(int epfd, int flags,
> >                         struct epoll_event *events,
> >                         int maxevents,
> >                         struct epoll_wait_params *params);
> 
> Quick, possibly dumb question: might it make sense to also pass in 
> sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
> another parameter in the future, the kernel can tell which version is in
> use and they won't have to do an epoll_pwait2()?
> 

Flags can be used for that, if the change is not radically different.

Fam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-16  1:02   ` Fam Zheng
@ 2015-02-16  7:25     ` Seymour, Shane M
  2015-02-16  8:12       ` Fam Zheng
  2015-02-18 18:49     ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Seymour, Shane M @ 2015-02-16  7:25 UTC (permalink / raw)
  To: Fam Zheng, Jonathan Corbet
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

I found the manual pages really confusing so I had a go at rewriting
them - there were places in the manual page that didn't match the
functionality provided by your code as well as I could tell).

My apologies for a few formatting issues though. I still don't like
parts of epoll_pwait1 but it's less confusing than it was.

You are free to take some or all or none of the changes.

I did have a question I marked with **** below about what you
describe and what your code does.

1) epoll_ctl_batch
------------------

NAME
       epoll_ctl_batch - batch control interface for an epoll descriptor

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_ctl_batch(int epfd, int flags,
                           int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

       This system call is an extension of epoll_ctl(). The primary
       difference is that this system call allows you to batch multiple
       operations with the one system call. This provides a more efficient
       interface for updating events on this epoll file descriptor epfd.

       The flags argument is reserved and must be 0.

       The argument ncmds is the number of cmds entries being passed in.
       This number must be greater than 0.

       Each operation is specified as an element in the cmds array, defined as:

           struct epoll_ctl_cmd {

                  /* Reserved flags for future extension, must be 0. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code after this
                   * command is executed by kernel */
                  int result;
           };

       This system call is not atomic when updating the epoll descriptor.
       All entries in cmds are executed in the order provided. If any cmds
       entry fails to be processed no further entries are processed and 
       the number of successfully processed entries is returned.

       Each single operation defined by a struct epoll_ctl_cmd has the same 
       semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page
       for more information about how to correctly setup the members of a
       struct epoll_ctl_cmd.

       Upon completion of the call the result member of each struct epoll_ctl_cmd
       may be set to 0 (sucessfully completed) or an error code depending on the
       result of the command. If the kernel fails to change the result (for
       example the location of the cmds argument is fully or partly read only)
       the result member of each struct epoll_ctl_cmd may be unchanged. 

RETURN VALUE

       epoll_ctl_batch() returns a number greater than 0 to indicate the number
       of cmnd entries processed. If all entries have been processed this will
       equal the ncmds parameter passed in.

       If one or more parameters are incorrect the value returned is -1 with
       errno set appropriately - no cmds entries have been processed when this
       happens.

       If processing any entry in the cmds argument results in an error the
       number returned is the number of the failing entry - this number will be
       less than ncmds. Since ncmds must be greater than 0 a return value of
       0 indicates an error associated with the very first cmds entry. 
       A return value of 0 does not indicate a successful system call.

       To correctly test the return value from epoll_ctl_batch() use code similar
       to the following:

		ret=epoll_ctl_batch(epfd, flags, ncmds, &cmds);
		if (ret < ncmds) {
			if (ret == -1) {
				/* An argument was invalid */
			} else {
				/* ret contains the number of successful entries
                                 * processed. If you (mis?)use it as a C index it
                                 * will index directly to the failing entry to
                                 * get the result use cmds[ret].result which may 
                                 * contain the errno value associated with the
                                 * entry.
                                 */
			}
		} else {
			/* Success */
		}

ERRORS

       EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
              cmds is NULL.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       EFAULT The memory area pointed to by cmds is not accessible with read
              permissions.

       In the event that the return value is not the same as the ncmds parameter
       the result member of the failing struct epoll_ctl_cmd will contain a
       negative errno value related to the error. The errno values that can be set
       are those documented on the epoll_ctl(2) manual page.

CONFORMING TO

       epoll_ctl_batch() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)


2) epoll_pwait1
---------------

NAME
       epoll_pwait1 - wait for an I/O event on an epoll file descriptor

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_pwait1(int epfd, int flags,
                        struct epoll_event *events,
                        int maxevents,
                        struct epoll_wait_params *params);

DESCRIPTION

       The epoll_pwait1() syscall differs from epoll_pwait() only in
       parameter list. The epfd, events and maxevents parameters are the same
       as in epoll_wait() and epoll_pwait(). The flags and params are new.

       The flags is reserved and must be zero.

       The params is a pointer to a struct epoll_wait_params which is
       defined as:

           struct epoll_wait_params {
               int clockid;
               struct timespec timeout;
               sigset_t *sigmask;
               size_t sigsetsize;
           };

       The clockid member must be either CLOCK_REALTIME or CLOCK_MONOTONIC.
       This will choose the clock type to use for timeout. This differs to
       epoll_pwait(2) which has an implicit clock type of CLOCK_MONOTONIC.
       
       The timeout member specifies the minimum time that epoll_wait(2) will
       block. The time spent waiting will be rounded up to the clock
       granularity. Kernel scheduling delays mean that the blocking
       interval may overrun by a small amount. Specifying a -1 for either
       tv_sec or tv_nsec member of the struct timespec timeout will cause
       causes epoll_pwait1(2) to block indefinitely. Specifying a timeout
       equal to zero (both tv_sec or tv_nsec member of the struct timespec
       timeout are zero) causes epoll_wait(2) to return immediately, even
       if no events are available.

**** Are you really really sure about this for the -1 stuff? your code copies in the timespec and just passes it to timespec_to_ktime:

+	if (copy_from_user(&p, params, sizeof(p)))
+		return -EFAULT;
...
+	kt = timespec_to_ktime(p.timeout);

Compare that to something like the futex syscall which does this:

		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
			return -EFAULT;
		if (!timespec_valid(&ts))
			return -EINVAL;

		t = timespec_to_ktime(ts);

If the timespec is not valid it returns -EINVAL back to user space. With your settings of tv_sec and/or tv_usec to -1 are you relying on a side effect of the conversion that could break your code in the future if in the unlikely event someone changes timespec_to_ktime() and should it be:

+	if (copy_from_user(&p, params, sizeof(p)))
+		return -EFAULT;
+       if ((p.timeout.tv_sec == -1) || (p.timeout.tv_nsec == -1)) {
+  /* this is off the top of my head no idea if it will compile */
+		p.timeout.tv_sec = KTIME_SEC_MAX;
+		p.timeout.tv_nsec = 0;
+	}
+       if (!timespec_valid(&p.timeout))
+       	return -EINVAL;
...
+	kt = timespec_to_ktime(p.timeout);

I could of course be worried about nothing here is what I've suggested the right thing to do? Anyone feel free to chime in.

       Both sigmask and sigsetsize have the same semantics as epoll_pwait(2). The
       sigmask field may be specified as NULL, in which case epoll_pwait1(2)
       will behave like epoll_wait(2).

   User visibility of sigsetsize

       In epoll_pwait(2) and other syscalls, sigsetsize is not visible to
       an application developer as glibc has a wrapper around epoll_pwait(2).
       Now we pack several parameters in epoll_wait_params. In
       order to hide sigsetsize from application code this system call also
       needs to be wrapped either by expanding parameters and building the
       structure in the wrapper function, or by only asking application to
       provide this part of the structure:

           struct epoll_wait_params_user {
               int clockid;
               struct timespec timeout;
               sigset_t *sigmask;
           };

      In the wrapper function it would be copied to a full structure and
      sigsetsize filled in.

RETURN VALUE

       When successful, epoll_wait1() returns the number of file descriptors
       ready for the requested I/O, or zero if no file descriptor became ready
       during the requested timeout nanoseconds. When an error occurs, 
       epoll_wait1() returns -1 and errno is set appropriately.

ERRORS

       This system call can set errno to the same values as epoll_pwait(2), 
       as well as the following additional reasons:

       EINVAL flags is not zero, or clockid is not one of CLOCK_REALTIME or
              CLOCK_MONOTONIC.

       EFAULT The memory area pointed to by params is not accessible.


CONFORMING TO

       epoll_pwait1() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-16  7:25     ` Seymour, Shane M
@ 2015-02-16  8:12       ` Fam Zheng
  0 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-16  8:12 UTC (permalink / raw)
  To: Seymour, Shane M
  Cc: Jonathan Corbet, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	Andy Lutomirski, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Hi Seymour,

On Mon, 02/16 07:25, Seymour, Shane M wrote:
> I found the manual pages really confusing so I had a go at rewriting
> them - there were places in the manual page that didn't match the
> functionality provided by your code as well as I could tell).

Could you point which places don't match the code?

> 
> My apologies for a few formatting issues though. I still don't like
> parts of epoll_pwait1 but it's less confusing than it was.

Any other than the timespec question don't you like?

> 
> You are free to take some or all or none of the changes.
> 
> I did have a question I marked with **** below about what you
> describe and what your code does.
> 

<snip>

>        The timeout member specifies the minimum time that epoll_wait(2) will
>        block. The time spent waiting will be rounded up to the clock
>        granularity. Kernel scheduling delays mean that the blocking
>        interval may overrun by a small amount. Specifying a -1 for either
>        tv_sec or tv_nsec member of the struct timespec timeout will cause
>        causes epoll_pwait1(2) to block indefinitely. Specifying a timeout
>        equal to zero (both tv_sec or tv_nsec member of the struct timespec
>        timeout are zero) causes epoll_wait(2) to return immediately, even
>        if no events are available.
> 
> **** Are you really really sure about this for the -1 stuff? your code copies
> in the timespec and just passes it to timespec_to_ktime:
> 
> +	if (copy_from_user(&p, params, sizeof(p)))
> +		return -EFAULT;
> ...
> +	kt = timespec_to_ktime(p.timeout);
> 
> Compare that to something like the futex syscall which does this:
> 
> 		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
> 			return -EFAULT;
> 		if (!timespec_valid(&ts))
> 			return -EINVAL;
> 
> 		t = timespec_to_ktime(ts);
> 
> If the timespec is not valid it returns -EINVAL back to user space. With your
> settings of tv_sec and/or tv_usec to -1 are you relying on a side effect of
> the conversion that could break your code in the future if in the unlikely
> event someone changes timespec_to_ktime() and should it be:
> 
> +	if (copy_from_user(&p, params, sizeof(p)))
> +		return -EFAULT;
> +       if ((p.timeout.tv_sec == -1) || (p.timeout.tv_nsec == -1)) {
> +  /* this is off the top of my head no idea if it will compile */
> +		p.timeout.tv_sec = KTIME_SEC_MAX;
> +		p.timeout.tv_nsec = 0;
> +	}
> +       if (!timespec_valid(&p.timeout))
> +       	return -EINVAL;
> ...
> +	kt = timespec_to_ktime(p.timeout);

OK. timespec_valid() is clear about this: negative tv_sec is invalid, so I
don't think accepting -1 from user is the right thing to do.

We cannot do pointer check as ppoll already because the structure is embedded
in epoll_wait_params.

Maybe it's best to use a flags bit (#define EPOLL_PWAIT1_BLOCK 1).  What do you
think?

Fam

<snip>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-16  1:02   ` Fam Zheng
  2015-02-16  7:25     ` Seymour, Shane M
@ 2015-02-18 18:49     ` Ingo Molnar
  2015-02-25  3:30       ` Fam Zheng
  1 sibling, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2015-02-18 18:49 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Jonathan Corbet, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	Andy Lutomirski, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval


* Fam Zheng <famz@redhat.com> wrote:

> On Sun, 02/15 15:00, Jonathan Corbet wrote:
> > On Fri, 13 Feb 2015 17:03:56 +0800
> > Fam Zheng <famz@redhat.com> wrote:
> > 
> > > SYNOPSIS
> > > 
> > >        #include <sys/epoll.h>
> > > 
> > >        int epoll_pwait1(int epfd, int flags,
> > >                         struct epoll_event *events,
> > >                         int maxevents,
> > >                         struct epoll_wait_params *params);
> > 
> > Quick, possibly dumb question: might it make sense to also pass in 
> > sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
> > another parameter in the future, the kernel can tell which version is in
> > use and they won't have to do an epoll_pwait2()?
> > 
> 
> Flags can be used for that, if the change is not 
> radically different.

Passing in size is generally better than flags, because 
that way an extension of the ABI (new field[s]) 
automatically signals towards the kernel what to do with 
old binaries - while extending the functionality of new 
binaries, without sacrificing functionality.

With flags you are either limited to the same structure 
size - or have to decode a 'size' value from the flags 
value - which is fragile (and in which case a real 'size' 
parameter is better).

in the perf ABI we use something like that: there's a 
perf_attr.size parameter that iterates the ABI forward, 
while still being binary compatible with older software.

If old binaries pass in a smaller structure to a newer 
kernel then the kernel pads the new fields with zero by 
default - that way the kernel internals are never burdened 
with compatibility details and data format versions.

If new user-space passes in a large structure than the 
kernel can handle then the kernel returns an error - this 
way user-space can transparently support conditional 
features and fallback logic.

It works really well, we've done literally a hundred perf 
ABI extensions this way in the last 4+ years, in a pretty 
natural fashion, without littering the kernel (or 
user-space) with version legacies and without breaking 
existing perf tooling.

Other syscall ABIs already get painful when trying to 
handle 2-3 data structure versions, so people either give 
up, or add flags kludges or go to new syscall entries: 
which is painful in its own fashion and adds unnecessary 
latency to feature introduction as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-18 18:49     ` Ingo Molnar
@ 2015-02-25  3:30       ` Fam Zheng
  0 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2015-02-25  3:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jonathan Corbet, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	Andy Lutomirski, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

On Wed, 02/18 19:49, Ingo Molnar wrote:
> 
> * Fam Zheng <famz@redhat.com> wrote:
> 
> > On Sun, 02/15 15:00, Jonathan Corbet wrote:
> > > On Fri, 13 Feb 2015 17:03:56 +0800
> > > Fam Zheng <famz@redhat.com> wrote:
> > > 
> > > > SYNOPSIS
> > > > 
> > > >        #include <sys/epoll.h>
> > > > 
> > > >        int epoll_pwait1(int epfd, int flags,
> > > >                         struct epoll_event *events,
> > > >                         int maxevents,
> > > >                         struct epoll_wait_params *params);
> > > 
> > > Quick, possibly dumb question: might it make sense to also pass in 
> > > sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
> > > another parameter in the future, the kernel can tell which version is in
> > > use and they won't have to do an epoll_pwait2()?
> > > 
> > 
> > Flags can be used for that, if the change is not 
> > radically different.
> 
> Passing in size is generally better than flags, because 
> that way an extension of the ABI (new field[s]) 
> automatically signals towards the kernel what to do with 
> old binaries - while extending the functionality of new 
> binaries, without sacrificing functionality.
> 
> With flags you are either limited to the same structure 
> size - or have to decode a 'size' value from the flags 
> value - which is fragile (and in which case a real 'size' 
> parameter is better).
> 
> in the perf ABI we use something like that: there's a 
> perf_attr.size parameter that iterates the ABI forward, 
> while still being binary compatible with older software.
> 
> If old binaries pass in a smaller structure to a newer 
> kernel then the kernel pads the new fields with zero by 
> default - that way the kernel internals are never burdened 
> with compatibility details and data format versions.
> 
> If new user-space passes in a large structure than the 
> kernel can handle then the kernel returns an error - this 
> way user-space can transparently support conditional 
> features and fallback logic.
> 
> It works really well, we've done literally a hundred perf 
> ABI extensions this way in the last 4+ years, in a pretty 
> natural fashion, without littering the kernel (or 
> user-space) with version legacies and without breaking 
> existing perf tooling.
> 
> Other syscall ABIs already get painful when trying to 
> handle 2-3 data structure versions, so people either give 
> up, or add flags kludges or go to new syscall entries: 
> which is painful in its own fashion and adds unnecessary 
> latency to feature introduction as well.
> 

Excellent. This now makes a lot of sense to me, thanks to your explanations,
Ingo.

I'll add the "size" field in the next revision.

Thanks,
Fam

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-02-25  3:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-13  9:03 [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-02-13  9:03 ` [PATCH RFC v3 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
2015-02-13  9:03 ` [PATCH RFC v3 2/7] epoll: Specify clockid explicitly Fam Zheng
2015-02-13  9:03 ` [PATCH RFC v3 3/7] epoll: Extract ep_ctl_do Fam Zheng
2015-02-13  9:04 ` [PATCH RFC v3 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
2015-02-13  9:04 ` [PATCH RFC v3 5/7] x86: Hook up epoll_ctl_batch syscall Fam Zheng
2015-02-13  9:04 ` [PATCH RFC v3 6/7] epoll: Add implementation for epoll_pwait1 Fam Zheng
2015-02-13  9:04 ` [PATCH RFC v3 7/7] x86: Hook up epoll_pwait1 syscall Fam Zheng
2015-02-13  9:53 ` [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Omar Sandoval
2015-02-15  6:44   ` Fam Zheng
2015-02-15 15:16   ` Michael Kerrisk (man-pages)
2015-02-15 22:00 ` Jonathan Corbet
2015-02-16  1:02   ` Fam Zheng
2015-02-16  7:25     ` Seymour, Shane M
2015-02-16  8:12       ` Fam Zheng
2015-02-18 18:49     ` Ingo Molnar
2015-02-25  3:30       ` Fam Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).