linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
@ 2015-02-04 10:36 Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Changes from v1 (https://lkml.org/lkml/2015/1/20/189):

  - As discussed in previous thread [1], split the call to epoll_ctl_batch and
    epoll_pwait. [Michael]

  - Fix memory leaks. [Omar]

  - Add a short comment about the ignored copy_to_user failure. [Omar]

  - Cover letter rewritten.

This adds two new system calls as described below. I mentioned glibc wrapping
of sigarg in epoll_pwait1 description, so don't read it as end user man page
yet.

One caveat is the possible failure of the last copy_to_user in epoll_ctl_batch,
which returns per command error code. Ideas to improve that are welcome!

1) epoll_ctl_batch
------------------

NAME
       epoll_ctl_batch - modify an epoll descriptor in batch

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_ctl_batch(int epfd, int flags,
                           int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

       The system call performs a batch of epoll_ctl operations. It allows
       efficient update of events on this epoll descriptor.

       Flags is reserved and must be 0.

       Each operation is specified as an element in the cmds array, defined as:

           struct epoll_ctl_cmd {

                  /* Reserved flags for future extension, must be 0. */
                  int flags;

                  /* The same as epoll_ctl() op parameter. */
                  int op;

                  /* The same as epoll_ctl() fd parameter. */
                  int fd;

                  /* The same as the "events" field in struct epoll_event. */
                  uint32_t events;

                  /* The same as the "data" field in struct epoll_event. */
                  uint64_t data;

                  /* Output field, will be set to the return code after this
                   * command is executed by kernel */
                  int error_hint;
           };

       Commands are executed in their order in cmds, and if one of them failed,
       the rest after it are not tried.

       Not that this call isn't atomic in terms of updating the epoll
       descriptor, which means a second epoll_ctl or epoll_ctl_batch call
       during the first epoll_ctl_batch may make the operation sequence
       interleaved. However, each single epoll_ctl_cmd operation has the same
       semantics as a epoll_ctl call.

RETURN VALUE

       If one or more of the parameters are incorrect, -1 is returned and errno
       is set appropriately. Otherwise, the number of succeeded commands is
       returned.

       Each error_hint field may be set to the error code or 0, depending on
       the result of the command. If there is some error in returning the error
       to user, it may also be unchanged, even though the command may actually
       be executed. In this case, it's still ensured that the i-th (i = ret)
       command is the failed command.

ERRORS

       Errors for the overall system call (in errno) are:

       EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
              cmds is NULL.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       EFAULT The memory area pointed to by cmds is not accessible with write
              permissions.


       Errors for each command (in error_hint) are:

       EBADF  epfd or fd is not a valid file descriptor.

       EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
              already registered with this epoll instance.

       EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
              or the requested operation op is not supported by this interface.

       ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
              with this epoll instance.

       ENOMEM There was insufficient memory to handle the requested op control
              operation.

       ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
              encountered while trying to register (EPOLL_CTL_ADD) a new file
              descriptor on an epoll instance.  See epoll(7) for further
              details.

       EPERM  The target file fd does not support epoll.

CONFORMING TO

       epoll_ctl_batch() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)


2) epoll_pwait1
---------------

NAME
       epoll_pwait1 - wait for an I/O event on an epoll file descriptor

SYNOPSIS

       #include <sys/epoll.h>

       int epoll_pwait1(int epfd, int flags,
                        struct epoll_event *events,
                        int maxevents,
                        struct timespec *timeout,
                        struct sigargs *sig);

DESCRIPTION

       The epoll_pwait1 system call differs from epoll_pwait only in parameter
       types. The first difference is timeout, a pointer to timespec structure
       which allows nanosecond presicion; the second difference, which should
       probably be wrapper by glibc and only expose a sigset_t pointer as in
       pselect6.

       If timeout is NULL, it's treated as if 0 is specified in epoll_pwait
       (return immediately). Otherwise it's converted to nanosecond scalar,
       again, with the same convention as epoll_pwait's timeout.

RETURN VALUE

       The same as said in epoll_pwait(2).

ERRORS

       The same as said in man epoll_pwait(2), plus:

       EINVAL flags is not zero.


CONFORMING TO

       epoll_pwait1() is Linux-specific.

SEE ALSO

       epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

Fam Zheng (7):
  epoll: Extract epoll_wait_do and epoll_pwait_do
  epoll: Specify clockid explicitly
  epoll: Extract ep_ctl_do
  epoll: Add implementation for epoll_ctl_batch
  x86: Hook up epoll_ctl_batch syscall
  epoll: Add implementation for epoll_pwait1
  x86: Hook up epoll_pwait1 syscall

 arch/x86/syscalls/syscall_32.tbl |   2 +
 arch/x86/syscalls/syscall_64.tbl |   2 +
 fs/eventpoll.c                   | 241 ++++++++++++++++++++++++++-------------
 include/linux/syscalls.h         |   9 ++
 include/uapi/linux/eventpoll.h   |  11 ++
 5 files changed, 186 insertions(+), 79 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
@ 2015-02-04 10:36 ` Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 2/7] epoll: Specify clockid explicitly Fam Zheng
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

In preparation of new epoll syscalls, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 130 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 59 insertions(+), 71 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..4cf359d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
 	return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
 }
 
-static inline struct timespec ep_set_mstimeout(long ms)
-{
-	struct timespec now, ts = {
-		.tv_sec = ms / MSEC_PER_SEC,
-		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
-	};
-
-	ktime_get_ts(&now);
-	return timespec_add_safe(now, ts);
-}
-
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller supplied
  *           event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
  * @events: Pointer to the userspace buffer where the ready events should be
  *          stored.
  * @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- *           milliseconds. If the @timeout is zero, the function will not block,
- *           while if the @timeout is less than zero, the function will block
- *           until at least one event has been retrieved (or an error
- *           occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation.  If 0, the
+ *           function will not block. If negative, the function will block until
+ *           at least one event has been retrieved (or an error occurred).
  *
  * Returns: Returns the number of ready events which have been fetched, or an
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout)
+		   int maxevents, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	wait_queue_t wait;
 	ktime_t expires, *to = NULL;
 
-	if (timeout > 0) {
-		struct timespec end_time = ep_set_mstimeout(timeout);
-
-		slack = select_estimate_accuracy(&end_time);
-		to = &expires;
-		*to = timespec_to_ktime(end_time);
-	} else if (timeout == 0) {
+	if (!ktime_to_ns(timeout)) {
 		/*
 		 * Avoid the unnecessary trip to the wait queue loop, if the
 		 * caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		timed_out = 1;
 		spin_lock_irqsave(&ep->lock, flags);
 		goto check_events;
+	} else if (ktime_to_ns(timeout) > 0) {
+		struct timespec now, end_time;
+
+		ktime_get_ts(&now);
+		end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+		slack = select_estimate_accuracy(&end_time);
+		to = &expires;
+		*to = timespec_to_ktime(end_time);
 	}
 
 fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
 	return error;
 }
 
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+				int maxevents, const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -2002,29 +1988,32 @@ error_fput:
 
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_pwait(2).
+ * part of the user space epoll_wait(2).
  */
-SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
-		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
-		size_t, sigsetsize)
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+				 int maxevents, ktime_t timeout,
+				 sigset_t *sigmask, size_t sigsetsize)
 {
 	int error;
-	sigset_t ksigmask, sigsaved;
+	sigset_t sigsaved;
 
 	/*
 	 * If the caller wants a certain signal mask to be set during the wait,
 	 * we apply it here.
 	 */
 	if (sigmask) {
-		if (sigsetsize != sizeof(sigset_t))
-			return -EINVAL;
-		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
-			return -EFAULT;
 		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
+		set_current_blocked(sigmask);
 	}
 
-	error = sys_epoll_wait(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2044,49 +2033,48 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 	return error;
 }
 
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_pwait(2).
+ */
+SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
+		int, maxevents, int, timeout, const sigset_t __user *, sigmask,
+		size_t, sigsetsize)
+{
+	ktime_t kt = ms_to_ktime(timeout);
+	sigset_t ksigmask;
+
+	if (sigmask) {
+		if (sigsetsize != sizeof(sigset_t))
+			return -EINVAL;
+		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+			return -EFAULT;
+	}
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
-			struct epoll_event __user *, events,
-			int, maxevents, int, timeout,
-			const compat_sigset_t __user *, sigmask,
-			compat_size_t, sigsetsize)
+		       struct epoll_event __user *, events,
+		       int, maxevents, int, timeout,
+		       const compat_sigset_t __user *, sigmask,
+		       compat_size_t, sigsetsize)
 {
-	long err;
 	compat_sigset_t csigmask;
-	sigset_t ksigmask, sigsaved;
+	sigset_t ksigmask;
+	ktime_t kt = ms_to_ktime(timeout);
 
-	/*
-	 * If the caller wants a certain signal mask to be set during the wait,
-	 * we apply it here.
-	 */
 	if (sigmask) {
 		if (sigsetsize != sizeof(compat_sigset_t))
 			return -EINVAL;
 		if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
 			return -EFAULT;
 		sigset_from_compat(&ksigmask, &csigmask);
-		sigsaved = current->blocked;
-		set_current_blocked(&ksigmask);
-	}
-
-	err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
-	/*
-	 * If we changed the signal mask, we need to restore the original one.
-	 * In case we've got a signal while waiting, we do not restore the
-	 * signal mask yet, and we allow do_signal() to deliver the signal on
-	 * the way back to userspace, before the signal mask is restored.
-	 */
-	if (sigmask) {
-		if (err == -EINTR) {
-			memcpy(&current->saved_sigmask, &sigsaved,
-			       sizeof(sigsaved));
-			set_restore_sigmask();
-		} else
-			set_current_blocked(&sigsaved);
 	}
 
-	return err;
+	return epoll_pwait_do(epfd, events, maxevents, kt,
+			      sigmask ? &ksigmask : NULL, sigsetsize);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 2/7] epoll: Specify clockid explicitly
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
@ 2015-02-04 10:36 ` Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 3/7] epoll: Extract ep_ctl_do Fam Zheng
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4cf359d..01f469a 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
  *          error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, const ktime_t timeout)
+		   int maxevents, int clockid, const ktime_t timeout)
 {
 	int res = 0, eavail, timed_out = 0;
 	unsigned long flags;
@@ -1624,7 +1624,8 @@ fetch_events:
 			}
 
 			spin_unlock_irqrestore(&ep->lock, flags);
-			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+			if (!schedule_hrtimeout_range_clock(to, slack,
+						HRTIMER_MODE_ABS, clockid))
 				timed_out = 1;
 
 			spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1946,8 @@ error_return:
 }
 
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
-				int maxevents, const ktime_t timeout)
+				int maxevents, int clockid,
+				const ktime_t timeout)
 {
 	int error;
 	struct fd f;
@@ -1979,7 +1981,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 	ep = f.file->private_data;
 
 	/* Time to fish for events ... */
-	error = ep_poll(ep, events, maxevents, timeout);
+	error = ep_poll(ep, events, maxevents, clockid, timeout);
 
 error_fput:
 	fdput(f);
@@ -1994,12 +1996,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 		int, maxevents, int, timeout)
 {
 	ktime_t kt = ms_to_ktime(timeout);
-	return epoll_wait_do(epfd, events, maxevents, kt);
+	return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
 }
 
 static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
-				 int maxevents, ktime_t timeout,
-				 sigset_t *sigmask, size_t sigsetsize)
+				 int maxevents,
+				 int clockid, ktime_t timeout,
+				 sigset_t *sigmask)
 {
 	int error;
 	sigset_t sigsaved;
@@ -2013,7 +2016,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
 		set_current_blocked(sigmask);
 	}
 
-	error = epoll_wait_do(epfd, events, maxevents, timeout);
+	error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
 
 	/*
 	 * If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2053,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
 			return -EFAULT;
 	}
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 
 #ifdef CONFIG_COMPAT
@@ -2073,8 +2076,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		sigset_from_compat(&ksigmask, &csigmask);
 	}
 
-	return epoll_pwait_do(epfd, events, maxevents, kt,
-			      sigmask ? &ksigmask : NULL, sigsetsize);
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+			      sigmask ? &ksigmask : NULL);
 }
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 3/7] epoll: Extract ep_ctl_do
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 2/7] epoll: Specify clockid explicitly Fam Zheng
@ 2015-02-04 10:36 ` Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

This is a common part from epoll_ctl implementation which will be shared with
the coming epoll_ctl_wait.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 01f469a..a1c313c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1808,22 +1808,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
 {
 	int error;
 	int full_check = 0;
 	struct fd f, tf;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
 	struct eventpoll *tep = NULL;
 
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
 	error = -EBADF;
 	f = fdget(epfd);
 	if (!f.file)
@@ -1945,6 +1938,23 @@ error_return:
 	return error;
 }
 
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	struct epoll_event epds;
+
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		return -EFAULT;
+
+	return ep_ctl_do(epfd, op, fd, epds);
+}
+
 static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
 				int maxevents, int clockid,
 				const ktime_t timeout)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 4/7] epoll: Add implementation for epoll_ctl_batch
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (2 preceding siblings ...)
  2015-02-04 10:36 ` [PATCH RFC v2 3/7] epoll: Extract ep_ctl_do Fam Zheng
@ 2015-02-04 10:36 ` Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 5/7] x86: Hook up epoll_ctl_batch syscall Fam Zheng
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

This new syscall is a batched version of epoll_ctl. It will execute each
command as specified in cmds in given order, and stop at first failure
or upon completion of all commands.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c                 | 48 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h       |  4 ++++
 include/uapi/linux/eventpoll.h | 11 ++++++++++
 3 files changed, 63 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a1c313c..4eb40e0 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2067,6 +2067,54 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 			      sigmask ? &ksigmask : NULL);
 }
 
+SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
+		int, ncmds, struct epoll_ctl_cmd __user *, cmds)
+{
+	struct epoll_ctl_cmd *kcmds = NULL;
+	int i, r, ret = 0;
+	int cmd_size;
+
+	if (flags)
+		return -EINVAL;
+	if (ncmds <= 0 || !cmds)
+		return -EINVAL;
+	cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
+	kcmds = kmalloc(cmd_size, GFP_KERNEL);
+	if (!kcmds)
+		return -ENOMEM;
+	if (copy_from_user(kcmds, cmds, cmd_size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	for (i = 0; i < ncmds; i++) {
+		struct epoll_event ev = (struct epoll_event) {
+			.events = kcmds[i].events,
+			.data = kcmds[i].data,
+		};
+		if (kcmds[i].flags) {
+			kcmds[i].error_hint = -EINVAL;
+			goto copy;
+		}
+		kcmds[i].error_hint = ep_ctl_do(epfd, kcmds[i].op,
+						kcmds[i].fd, ev);
+		if (kcmds[i].error_hint)
+			goto copy;
+		ret++;
+	}
+copy:
+	r = copy_to_user(cmds, kcmds,
+			 sizeof(struct epoll_ctl_cmd) * ncmds);
+	/* Failing to copy the command results back will leave
+	 * userspace no way to know the actual error code, but we still
+	 * report the number of succeeded commands with ret, so it's
+	 * not a big problem. Ignore it for now.
+	 */
+	(void) r;
+out:
+	kfree(kcmds);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..117822c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,7 @@
 #define _LINUX_SYSCALLS_H
 
 struct epoll_event;
+struct epoll_ctl_cmd;
 struct iattr;
 struct inode;
 struct iocb;
@@ -630,6 +631,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
+				    int ncmds,
+				    struct epoll_ctl_cmd __user *cmds);
 asmlinkage long sys_gethostname(char __user *name, int len);
 asmlinkage long sys_sethostname(char __user *name, int len);
 asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..bbdce3d 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
 #include <linux/fcntl.h>
 #include <linux/types.h>
 
+#include <linux/signal.h>
+
 /* Flags for epoll_create1.  */
 #define EPOLL_CLOEXEC O_CLOEXEC
 
@@ -61,6 +63,15 @@ struct epoll_event {
 	__u64 data;
 } EPOLL_PACKED;
 
+struct epoll_ctl_cmd {
+	int flags;
+	int op;
+	int fd;
+	__u32 events;
+	__u64 data;
+	int error_hint;
+} EPOLL_PACKED;
+
 #ifdef CONFIG_PM_SLEEP
 static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 5/7] x86: Hook up epoll_ctl_batch syscall
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (3 preceding siblings ...)
  2015-02-04 10:36 ` [PATCH RFC v2 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
@ 2015-02-04 10:36 ` Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 6/7] epoll: Add implementation for epoll_pwait1 Fam Zheng
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..fe809f6 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	epoll_ctl_batch		sys_epoll_ctl_batch
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..67b2ea4 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	epoll_ctl_batch		sys_epoll_ctl_batch
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 6/7] epoll: Add implementation for epoll_pwait1
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (4 preceding siblings ...)
  2015-02-04 10:36 ` [PATCH RFC v2 5/7] x86: Hook up epoll_ctl_batch syscall Fam Zheng
@ 2015-02-04 10:36 ` Fam Zheng
  2015-02-04 10:36 ` [PATCH RFC v2 7/7] x86: Hook up epoll_pwait1 syscall Fam Zheng
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

This is the new implementation for poll which uses timespec as timeout,
and has a flags parameter.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 fs/eventpoll.c           | 34 ++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h |  5 +++++
 2 files changed, 39 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4eb40e0..59b2db9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2115,6 +2115,40 @@ out:
 	return ret;
 }
 
+SYSCALL_DEFINE6(epoll_pwait1, int, epfd, int, flags,
+		struct epoll_event __user *, events,
+		int, maxevents, struct timespec __user *, timeout,
+		void __user *, sig)
+{
+	size_t sigsetsize = 0;
+	sigset_t ksigmask;
+	sigset_t __user *sigmask = NULL;
+	struct timespec kts;
+	ktime_t kt = { 0 };
+
+	if (flags)
+		return -EINVAL;
+	if (sig) {
+		if (!access_ok(VERIFY_READ, sig, sizeof(void *)+sizeof(size_t))
+		    || __get_user(sigmask, (sigset_t __user * __user *)sig)
+		    || __get_user(sigsetsize,
+				  (size_t __user *)(sig+sizeof(void *))))
+			return -EFAULT;
+		if (sigsetsize != sizeof(sigset_t))
+			return -EINVAL;
+		if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
+			return -EFAULT;
+	}
+	if (timeout) {
+		if (copy_from_user(&kts, timeout, sizeof(kts)))
+			return -EFAULT;
+		kt = timespec_to_ktime(kts);
+	}
+
+	return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC,
+			      kt, &ksigmask);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 		       struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 117822c..0c3ecdb 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -631,6 +631,11 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
 				int maxevents, int timeout,
 				const sigset_t __user *sigmask,
 				size_t sigsetsize);
+asmlinkage long sys_epoll_pwait1(int epfd, int flags,
+				 struct epoll_event __user *events,
+				 int maxevents,
+				 struct timespec __user *timeout,
+				 void __user *sig);
 asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
 				    int ncmds,
 				    struct epoll_ctl_cmd __user *cmds);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 7/7] x86: Hook up epoll_pwait1 syscall
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (5 preceding siblings ...)
  2015-02-04 10:36 ` [PATCH RFC v2 6/7] epoll: Add implementation for epoll_pwait1 Fam Zheng
@ 2015-02-04 10:36 ` Fam Zheng
  2015-02-04 10:50 ` [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
	linux-api, Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index fe809f6..bf912d8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -366,3 +366,4 @@
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
 359	i386	epoll_ctl_batch		sys_epoll_ctl_batch
+360	i386	epoll_pwait1		sys_epoll_pwait1
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 67b2ea4..9246ad5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -330,6 +330,7 @@
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
 323	64	epoll_ctl_batch		sys_epoll_ctl_batch
+324	64	epoll_pwait1		sys_epoll_pwait1
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (6 preceding siblings ...)
  2015-02-04 10:36 ` [PATCH RFC v2 7/7] x86: Hook up epoll_pwait1 syscall Fam Zheng
@ 2015-02-04 10:50 ` Fam Zheng
  2015-02-04 12:44 ` Michael Kerrisk (man-pages)
  2015-02-04 21:38 ` Andy Lutomirski
  9 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-04 10:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

On Wed, 02/04 18:36, Fam Zheng wrote:
> 
>        Not that this call isn't atomic in terms of updating the epoll

s/Not/Note/

>        descriptor, which means a second epoll_ctl or epoll_ctl_batch call
>        during the first epoll_ctl_batch may make the operation sequence
>        interleaved. However, each single epoll_ctl_cmd operation has the same
>        semantics as a epoll_ctl call.
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (7 preceding siblings ...)
  2015-02-04 10:50 ` [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
@ 2015-02-04 12:44 ` Michael Kerrisk (man-pages)
  2015-02-05  1:52   ` Fam Zheng
  2015-02-04 21:38 ` Andy Lutomirski
  9 siblings, 1 reply; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-02-04 12:44 UTC (permalink / raw)
  To: Fam Zheng, linux-kernel
  Cc: mtk.manpages, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini, Omar Sandoval

Hello Fam Zheng,

On 02/04/2015 11:36 AM, Fam Zheng wrote:
> Changes from v1 (https://lkml.org/lkml/2015/1/20/189):
> 
>   - As discussed in previous thread [1], split the call to epoll_ctl_batch and
>     epoll_pwait. [Michael]
> 
>   - Fix memory leaks. [Omar]
> 
>   - Add a short comment about the ignored copy_to_user failure. [Omar]
> 
>   - Cover letter rewritten.
> 
> This adds two new system calls as described below. I mentioned glibc wrapping
> of sigarg in epoll_pwait1 description, so don't read it as end user man page
> yet.

Fair enough. But I think it would be helpful to say a little more already.
See my comment below.

> One caveat is the possible failure of the last copy_to_user in epoll_ctl_batch,
> which returns per command error code. Ideas to improve that are welcome!
>
> 1) epoll_ctl_batch
> ------------------
> 
> NAME
>        epoll_ctl_batch - modify an epoll descriptor in batch
> 
> SYNOPSIS
> 
>        #include <sys/epoll.h>
> 
>        int epoll_ctl_batch(int epfd, int flags,
>                            int ncmds, struct epoll_ctl_cmd *cmds);
> 
> DESCRIPTION
> 
>        The system call performs a batch of epoll_ctl operations. It allows
>        efficient update of events on this epoll descriptor.
> 
>        Flags is reserved and must be 0.
> 
>        Each operation is specified as an element in the cmds array, defined as:
> 
>            struct epoll_ctl_cmd {
> 
>                   /* Reserved flags for future extension, must be 0. */
>                   int flags;
> 
>                   /* The same as epoll_ctl() op parameter. */
>                   int op;
> 
>                   /* The same as epoll_ctl() fd parameter. */
>                   int fd;
> 
>                   /* The same as the "events" field in struct epoll_event. */
>                   uint32_t events;
> 
>                   /* The same as the "data" field in struct epoll_event. */
>                   uint64_t data;
> 
>                   /* Output field, will be set to the return code after this
>                    * command is executed by kernel */
>                   int error_hint;

Why 'error_hint', rather than just stay 'status' or 'result'? I mean
is it really a "hint"? Also, it can be 0 meaning "success" (no error).

>            };
> 
>        Commands are executed in their order in cmds, and if one of them failed,
>        the rest after it are not tried.
> 
>        Not that this call isn't atomic in terms of updating the epoll
>        descriptor, which means a second epoll_ctl or epoll_ctl_batch call
>        during the first epoll_ctl_batch may make the operation sequence
>        interleaved. However, each single epoll_ctl_cmd operation has the same
>        semantics as a epoll_ctl call.

(Good to include that warning!)

> RETURN VALUE
> 
>        If one or more of the parameters are incorrect, -1 is returned and errno
>        is set appropriately. Otherwise, the number of succeeded commands is
>        returned.
> 
>        Each error_hint field may be set to the error code or 0, depending on
>        the result of the command. If there is some error in returning the error
>        to user, it may also be unchanged, even though the command may actually
>        be executed. In this case, it's still ensured that the i-th (i = ret)
>        command is the failed command.

Sorry -- I'm not clear here. Can you say some more here? What sort
of error might there be when "returning the error to the user"?
 
> ERRORS
> 
>        Errors for the overall system call (in errno) are:
> 
>        EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
>               cmds is NULL.
> 
>        ENOMEM There was insufficient memory to handle the requested op control
>               operation.
> 
>        EFAULT The memory area pointed to by cmds is not accessible with write
>               permissions.
> 
> 
>        Errors for each command (in error_hint) are:
> 
>        EBADF  epfd or fd is not a valid file descriptor.
> 
>        EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is
>               already registered with this epoll instance.
> 
>        EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd,
>               or the requested operation op is not supported by this interface.
> 
>        ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered
>               with this epoll instance.
> 
>        ENOMEM There was insufficient memory to handle the requested op control
>               operation.
> 
>        ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches was
>               encountered while trying to register (EPOLL_CTL_ADD) a new file
>               descriptor on an epoll instance.  See epoll(7) for further
>               details.
> 
>        EPERM  The target file fd does not support epoll.
> 
> CONFORMING TO
> 
>        epoll_ctl_batch() is Linux-specific.
> 
> SEE ALSO
> 
>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
> 
> 
> 2) epoll_pwait1
> ---------------
> 
> NAME
>        epoll_pwait1 - wait for an I/O event on an epoll file descriptor
> 
> SYNOPSIS
> 
>        #include <sys/epoll.h>
> 
>        int epoll_pwait1(int epfd, int flags,
>                         struct epoll_event *events,
>                         int maxevents,
>                         struct timespec *timeout,
>                         struct sigargs *sig);
> 
> DESCRIPTION
> 
>        The epoll_pwait1 system call differs from epoll_pwait only in parameter
>        types. The first difference is timeout, a pointer to timespec structure
>        which allows nanosecond presicion; the second difference, which should
>        probably be wrapper by glibc and only expose a sigset_t pointer as in
>        pselect6.

Here I think it still helps to explain that 'struct sigags' is a sigset_t* +
the size of the pointed-to set.

>        If timeout is NULL, it's treated as if 0 is specified in epoll_pwait

The convention I would expect here is that NULL means infinite timeout,
like select(), and timeout == {0,0} would get the "return immediately"
behavior. Why did you choose your convention? And, how does one otherwise 
request an infinite timeout?

>        (return immediately). Otherwise it's converted to nanosecond scalar,
>        again, with the same convention as epoll_pwait's timeout.
> 
> RETURN VALUE
> 
>        The same as said in epoll_pwait(2).
> 
> ERRORS
> 
>        The same as said in man epoll_pwait(2), plus:
> 
>        EINVAL flags is not zero.
> 
> 
> CONFORMING TO
> 
>        epoll_pwait1() is Linux-specific.
> 
> SEE ALSO
> 
>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)

In your earlier patch set, there was the ability to select the clock
used for timeouts. Why did this go away? I'm not sure whether we need
that functionality or not, but it would be good to know why you
dropped it this time.

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
                   ` (8 preceding siblings ...)
  2015-02-04 12:44 ` Michael Kerrisk (man-pages)
@ 2015-02-04 21:38 ` Andy Lutomirski
  2015-02-05  1:51   ` Fam Zheng
  9 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2015-02-04 21:38 UTC (permalink / raw)
  To: Fam Zheng
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	Linux FS Devel, Linux API, Josh Triplett,
	Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

On Wed, Feb 4, 2015 at 2:36 AM, Fam Zheng <famz@redhat.com> wrote:
> 2) epoll_pwait1
> ---------------
>
> NAME
>        epoll_pwait1 - wait for an I/O event on an epoll file descriptor
>
> SYNOPSIS
>
>        #include <sys/epoll.h>
>
>        int epoll_pwait1(int epfd, int flags,
>                         struct epoll_event *events,
>                         int maxevents,
>                         struct timespec *timeout,
>                         struct sigargs *sig);
>
> DESCRIPTION
>
>        The epoll_pwait1 system call differs from epoll_pwait only in parameter
>        types. The first difference is timeout, a pointer to timespec structure
>        which allows nanosecond presicion; the second difference, which should
>        probably be wrapper by glibc and only expose a sigset_t pointer as in
>        pselect6.
>
>        If timeout is NULL, it's treated as if 0 is specified in epoll_pwait
>        (return immediately). Otherwise it's converted to nanosecond scalar,
>        again, with the same convention as epoll_pwait's timeout.

Is the timeout absolute or relative?

I'd kind of like the ability to set timeouts on multiple clocks at the
same time, but I can live without that.

--Andy

>
> RETURN VALUE
>
>        The same as said in epoll_pwait(2).
>
> ERRORS
>
>        The same as said in man epoll_pwait(2), plus:
>
>        EINVAL flags is not zero.
>
>
> CONFORMING TO
>
>        epoll_pwait1() is Linux-specific.
>
> SEE ALSO
>
>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
>
> Fam Zheng (7):
>   epoll: Extract epoll_wait_do and epoll_pwait_do
>   epoll: Specify clockid explicitly
>   epoll: Extract ep_ctl_do
>   epoll: Add implementation for epoll_ctl_batch
>   x86: Hook up epoll_ctl_batch syscall
>   epoll: Add implementation for epoll_pwait1
>   x86: Hook up epoll_pwait1 syscall
>
>  arch/x86/syscalls/syscall_32.tbl |   2 +
>  arch/x86/syscalls/syscall_64.tbl |   2 +
>  fs/eventpoll.c                   | 241 ++++++++++++++++++++++++++-------------
>  include/linux/syscalls.h         |   9 ++
>  include/uapi/linux/eventpoll.h   |  11 ++
>  5 files changed, 186 insertions(+), 79 deletions(-)
>
> --
> 1.9.3
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-04 21:38 ` Andy Lutomirski
@ 2015-02-05  1:51   ` Fam Zheng
  0 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-05  1:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Alexander Viro, Andrew Morton, Kees Cook, David Herrmann,
	Alexei Starovoitov, Miklos Szeredi, David Drysdale,
	Oleg Nesterov, David S. Miller, Vivek Goyal, Mike Frysinger,
	Theodore Ts'o, Heiko Carstens, Rasmus Villemoes,
	Rashika Kheria, Hugh Dickins, Mathieu Desnoyers, Peter Zijlstra,
	Linux FS Devel, Linux API, Josh Triplett,
	Michael Kerrisk (man-pages),
	Paolo Bonzini, Omar Sandoval

On Wed, 02/04 13:38, Andy Lutomirski wrote:
> On Wed, Feb 4, 2015 at 2:36 AM, Fam Zheng <famz@redhat.com> wrote:
> > 2) epoll_pwait1
> > ---------------
> >
> > NAME
> >        epoll_pwait1 - wait for an I/O event on an epoll file descriptor
> >
> > SYNOPSIS
> >
> >        #include <sys/epoll.h>
> >
> >        int epoll_pwait1(int epfd, int flags,
> >                         struct epoll_event *events,
> >                         int maxevents,
> >                         struct timespec *timeout,
> >                         struct sigargs *sig);
> >
> > DESCRIPTION
> >
> >        The epoll_pwait1 system call differs from epoll_pwait only in parameter
> >        types. The first difference is timeout, a pointer to timespec structure
> >        which allows nanosecond presicion; the second difference, which should
> >        probably be wrapper by glibc and only expose a sigset_t pointer as in
> >        pselect6.
> >
> >        If timeout is NULL, it's treated as if 0 is specified in epoll_pwait
> >        (return immediately). Otherwise it's converted to nanosecond scalar,
> >        again, with the same convention as epoll_pwait's timeout.
> 
> Is the timeout absolute or relative?

Relative. Will document it. We can add a first flag for absolute timeout later.

Thanks.

Fam
> 
> I'd kind of like the ability to set timeouts on multiple clocks at the
> same time, but I can live without that.

Please see my reply to Michael.

Fam

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-04 12:44 ` Michael Kerrisk (man-pages)
@ 2015-02-05  1:52   ` Fam Zheng
  2015-02-05  7:44     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 15+ messages in thread
From: Fam Zheng @ 2015-02-05  1:52 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini, Omar Sandoval

On Wed, 02/04 13:44, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 02/04/2015 11:36 AM, Fam Zheng wrote:
> > Changes from v1 (https://lkml.org/lkml/2015/1/20/189):
> > 
> >   - As discussed in previous thread [1], split the call to epoll_ctl_batch and
> >     epoll_pwait. [Michael]
> > 
> >   - Fix memory leaks. [Omar]
> > 
> >   - Add a short comment about the ignored copy_to_user failure. [Omar]
> > 
> >   - Cover letter rewritten.
> > 
> > This adds two new system calls as described below. I mentioned glibc wrapping
> > of sigarg in epoll_pwait1 description, so don't read it as end user man page
> > yet.
> 
> Fair enough. But I think it would be helpful to say a little more already.
> See my comment below.
> 
> > One caveat is the possible failure of the last copy_to_user in epoll_ctl_batch,
> > which returns per command error code. Ideas to improve that are welcome!
> >
> > 1) epoll_ctl_batch
> > ------------------
> > 
> > NAME
> >        epoll_ctl_batch - modify an epoll descriptor in batch
> > 
> > SYNOPSIS
> > 
> >        #include <sys/epoll.h>
> > 
> >        int epoll_ctl_batch(int epfd, int flags,
> >                            int ncmds, struct epoll_ctl_cmd *cmds);
> > 
> > DESCRIPTION
> > 
> >        The system call performs a batch of epoll_ctl operations. It allows
> >        efficient update of events on this epoll descriptor.
> > 
> >        Flags is reserved and must be 0.
> > 
> >        Each operation is specified as an element in the cmds array, defined as:
> > 
> >            struct epoll_ctl_cmd {
> > 
> >                   /* Reserved flags for future extension, must be 0. */
> >                   int flags;
> > 
> >                   /* The same as epoll_ctl() op parameter. */
> >                   int op;
> > 
> >                   /* The same as epoll_ctl() fd parameter. */
> >                   int fd;
> > 
> >                   /* The same as the "events" field in struct epoll_event. */
> >                   uint32_t events;
> > 
> >                   /* The same as the "data" field in struct epoll_event. */
> >                   uint64_t data;
> > 
> >                   /* Output field, will be set to the return code after this
> >                    * command is executed by kernel */
> >                   int error_hint;
> 
> Why 'error_hint', rather than just stay 'status' or 'result'? I mean
> is it really a "hint"? Also, it can be 0 meaning "success" (no error).

Because the convention is weak here: the copy_to_user, to assign values to this
field in user provided array, could (very unlikely) fail, at which point the
commands are already executed so there is no return. This corner case
inconsistency makes it hard to be a contract.

> 
> >            };
> > 
> >        Commands are executed in their order in cmds, and if one of them failed,
> >        the rest after it are not tried.
> > 
> >        Not that this call isn't atomic in terms of updating the epoll
> >        descriptor, which means a second epoll_ctl or epoll_ctl_batch call
> >        during the first epoll_ctl_batch may make the operation sequence
> >        interleaved. However, each single epoll_ctl_cmd operation has the same
> >        semantics as a epoll_ctl call.
> 
> (Good to include that warning!)
> 
> > RETURN VALUE
> > 
> >        If one or more of the parameters are incorrect, -1 is returned and errno
> >        is set appropriately. Otherwise, the number of succeeded commands is
> >        returned.
> > 
> >        Each error_hint field may be set to the error code or 0, depending on
> >        the result of the command. If there is some error in returning the error
> >        to user, it may also be unchanged, even though the command may actually
> >        be executed. In this case, it's still ensured that the i-th (i = ret)
> >        command is the failed command.
> 
> Sorry -- I'm not clear here. Can you say some more here? What sort
> of error might there be when "returning the error to the user"?

As explained above. See also the last comment of Omar:

https://lkml.org/lkml/2015/1/21/103

<...>

> > DESCRIPTION
> > 
> >        The epoll_pwait1 system call differs from epoll_pwait only in parameter
> >        types. The first difference is timeout, a pointer to timespec structure
> >        which allows nanosecond presicion; the second difference, which should
> >        probably be wrapper by glibc and only expose a sigset_t pointer as in
> >        pselect6.
> 
> Here I think it still helps to explain that 'struct sigags' is a sigset_t* +
> the size of the pointed-to set.

Yes, will add.

> 
> >        If timeout is NULL, it's treated as if 0 is specified in epoll_pwait
> 
> The convention I would expect here is that NULL means infinite timeout,
> like select(), and timeout == {0,0} would get the "return immediately"
> behavior. Why did you choose your convention? And, how does one otherwise 
> request an infinite timeout?

I'm wrong. I should follow the conventions of pselect and ppoll. Thanks for
catching that.

> 
> >        (return immediately). Otherwise it's converted to nanosecond scalar,
> >        again, with the same convention as epoll_pwait's timeout.
> > 
> > RETURN VALUE
> > 
> >        The same as said in epoll_pwait(2).
> > 
> > ERRORS
> > 
> >        The same as said in man epoll_pwait(2), plus:
> > 
> >        EINVAL flags is not zero.
> > 
> > 
> > CONFORMING TO
> > 
> >        epoll_pwait1() is Linux-specific.
> > 
> > SEE ALSO
> > 
> >        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
> 
> In your earlier patch set, there was the ability to select the clock
> used for timeouts. Why did this go away? I'm not sure whether we need
> that functionality or not, but it would be good to know why you
> dropped it this time.
> 

No room for another "int clockid". Options:

0) Don't have it.

1) "Or" into lower bits of flags.

2) Encode into flags bits.

3) Squash "int clockid" in the last argument (currently "sig" but need to name
it "spec", again).

Which one do you prefer?

Fam

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-05  1:52   ` Fam Zheng
@ 2015-02-05  7:44     ` Michael Kerrisk (man-pages)
  2015-02-05  9:01       ` Fam Zheng
  0 siblings, 1 reply; 15+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-02-05  7:44 UTC (permalink / raw)
  To: Fam Zheng
  Cc: mtk.manpages, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Alexander Viro, Andrew Morton, Kees Cook,
	Andy Lutomirski, David Herrmann, Alexei Starovoitov,
	Miklos Szeredi, David Drysdale, Oleg Nesterov, David S. Miller,
	Vivek Goyal, Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini, Omar Sandoval

Hello Fam Zheng,

On 02/05/2015 02:52 AM, Fam Zheng wrote:
> On Wed, 02/04 13:44, Michael Kerrisk (man-pages) wrote:
>> Hello Fam Zheng,
>>
>> On 02/04/2015 11:36 AM, Fam Zheng wrote:
>>> Changes from v1 (https://lkml.org/lkml/2015/1/20/189):
>>>
>>>   - As discussed in previous thread [1], split the call to epoll_ctl_batch and
>>>     epoll_pwait. [Michael]
>>>
>>>   - Fix memory leaks. [Omar]
>>>
>>>   - Add a short comment about the ignored copy_to_user failure. [Omar]
>>>
>>>   - Cover letter rewritten.
>>>
>>> This adds two new system calls as described below. I mentioned glibc wrapping
>>> of sigarg in epoll_pwait1 description, so don't read it as end user man page
>>> yet.
>>
>> Fair enough. But I think it would be helpful to say a little more already.
>> See my comment below.
>>
>>> One caveat is the possible failure of the last copy_to_user in epoll_ctl_batch,
>>> which returns per command error code. Ideas to improve that are welcome!
>>>
>>> 1) epoll_ctl_batch
>>> ------------------
>>>
>>> NAME
>>>        epoll_ctl_batch - modify an epoll descriptor in batch
>>>
>>> SYNOPSIS
>>>
>>>        #include <sys/epoll.h>
>>>
>>>        int epoll_ctl_batch(int epfd, int flags,
>>>                            int ncmds, struct epoll_ctl_cmd *cmds);
>>>
>>> DESCRIPTION
>>>
>>>        The system call performs a batch of epoll_ctl operations. It allows
>>>        efficient update of events on this epoll descriptor.
>>>
>>>        Flags is reserved and must be 0.
>>>
>>>        Each operation is specified as an element in the cmds array, defined as:
>>>
>>>            struct epoll_ctl_cmd {
>>>
>>>                   /* Reserved flags for future extension, must be 0. */
>>>                   int flags;
>>>
>>>                   /* The same as epoll_ctl() op parameter. */
>>>                   int op;
>>>
>>>                   /* The same as epoll_ctl() fd parameter. */
>>>                   int fd;
>>>
>>>                   /* The same as the "events" field in struct epoll_event. */
>>>                   uint32_t events;
>>>
>>>                   /* The same as the "data" field in struct epoll_event. */
>>>                   uint64_t data;
>>>
>>>                   /* Output field, will be set to the return code after this
>>>                    * command is executed by kernel */
>>>                   int error_hint;
>>
>> Why 'error_hint', rather than just stay 'status' or 'result'? I mean
>> is it really a "hint"? Also, it can be 0 meaning "success" (no error).
> 
> Because the convention is weak here: the copy_to_user, to assign values to this
> field in user provided array, could (very unlikely) fail, at which point the
> commands are already executed so there is no return. This corner case
> inconsistency makes it hard to be a contract.

I still think it's a poor name. Yes, it could unlikely fail.
Still, 'status' or 'result_status' or something like that is better.

>>>            };
>>>
>>>        Commands are executed in their order in cmds, and if one of them failed,
>>>        the rest after it are not tried.
>>>
>>>        Not that this call isn't atomic in terms of updating the epoll
>>>        descriptor, which means a second epoll_ctl or epoll_ctl_batch call
>>>        during the first epoll_ctl_batch may make the operation sequence
>>>        interleaved. However, each single epoll_ctl_cmd operation has the same
>>>        semantics as a epoll_ctl call.
>>
>> (Good to include that warning!)
>>
>>> RETURN VALUE
>>>
>>>        If one or more of the parameters are incorrect, -1 is returned and errno
>>>        is set appropriately. Otherwise, the number of succeeded commands is
>>>        returned.
>>>
>>>        Each error_hint field may be set to the error code or 0, depending on
>>>        the result of the command. If there is some error in returning the error
>>>        to user, it may also be unchanged, even though the command may actually
>>>        be executed. In this case, it's still ensured that the i-th (i = ret)
>>>        command is the failed command.
>>
>> Sorry -- I'm not clear here. Can you say some more here? What sort
>> of error might there be when "returning the error to the user"?
> 
> As explained above. See also the last comment of Omar:
> 
> https://lkml.org/lkml/2015/1/21/103

Okay -- but could you please actually explain this in more detail in the 
man page. Then others do not need to ask the same question.

> <...>
> 
>>> DESCRIPTION
>>>
>>>        The epoll_pwait1 system call differs from epoll_pwait only in parameter
>>>        types. The first difference is timeout, a pointer to timespec structure
>>>        which allows nanosecond presicion; the second difference, which should
>>>        probably be wrapper by glibc and only expose a sigset_t pointer as in
>>>        pselect6.
>>
>> Here I think it still helps to explain that 'struct sigags' is a sigset_t* +
>> the size of the pointed-to set.
> 
> Yes, will add.
> 
>>
>>>        If timeout is NULL, it's treated as if 0 is specified in epoll_pwait
>>
>> The convention I would expect here is that NULL means infinite timeout,
>> like select(), and timeout == {0,0} would get the "return immediately"
>> behavior. Why did you choose your convention? And, how does one otherwise 
>> request an infinite timeout?
> 
> I'm wrong. I should follow the conventions of pselect and ppoll. Thanks for
> catching that.

Good.

>>
>>>        (return immediately). Otherwise it's converted to nanosecond scalar,
>>>        again, with the same convention as epoll_pwait's timeout.
>>>
>>> RETURN VALUE
>>>
>>>        The same as said in epoll_pwait(2).
>>>
>>> ERRORS
>>>
>>>        The same as said in man epoll_pwait(2), plus:
>>>
>>>        EINVAL flags is not zero.
>>>
>>>
>>> CONFORMING TO
>>>
>>>        epoll_pwait1() is Linux-specific.
>>>
>>> SEE ALSO
>>>
>>>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
>>
>> In your earlier patch set, there was the ability to select the clock
>> used for timeouts. Why did this go away? I'm not sure whether we need
>> that functionality or not, but it would be good to know why you
>> dropped it this time.
>>
> 
> No room for another "int clockid". Options:

Ahh yes. I overlooked that. But again, if your commit message or
the man page said something about why you dropped this argument,
then we would not run around in circles having the same 
conversations repeatedly.

I keep loving this recent quote from akpm:
http://lwn.net/Articles/616226/

> 0) Don't have it.
> 
> 1) "Or" into lower bits of flags.
> 
> 2) Encode into flags bits.
> 
> 3) Squash "int clockid" in the last argument (currently "sig" but need to name
> it "spec", again).

Well, it's up to you. In https://lkml.org/lkml/2015/1/20/189
you initially proposed to have the clockid. Do you need it,
or not? I am agnostic about it. If you do decide you want it,
then I think (3) is the best option.

I'm very pleased that you expand this man page as you go.
But, for the next iteration, I think it would be better to make
it even more complete--it really is helpful to have documentation
as a starting point to discuss the API, and the better the doc,
the better the discussion.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
  2015-02-05  7:44     ` Michael Kerrisk (man-pages)
@ 2015-02-05  9:01       ` Fam Zheng
  0 siblings, 0 replies; 15+ messages in thread
From: Fam Zheng @ 2015-02-05  9:01 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
	David Herrmann, Alexei Starovoitov, Miklos Szeredi,
	David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
	Mike Frysinger, Theodore Ts'o, Heiko Carstens,
	Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
	Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
	Josh Triplett, Paolo Bonzini, Omar Sandoval

On Thu, 02/05 08:44, Michael Kerrisk (man-pages) wrote:
> Hello Fam Zheng,
> 
> On 02/05/2015 02:52 AM, Fam Zheng wrote:
> > On Wed, 02/04 13:44, Michael Kerrisk (man-pages) wrote:
> >> Hello Fam Zheng,
> >>
> >> On 02/04/2015 11:36 AM, Fam Zheng wrote:
> >>> Changes from v1 (https://lkml.org/lkml/2015/1/20/189):
> >>>
> >>>   - As discussed in previous thread [1], split the call to epoll_ctl_batch and
> >>>     epoll_pwait. [Michael]
> >>>
> >>>   - Fix memory leaks. [Omar]
> >>>
> >>>   - Add a short comment about the ignored copy_to_user failure. [Omar]
> >>>
> >>>   - Cover letter rewritten.
> >>>
> >>> This adds two new system calls as described below. I mentioned glibc wrapping
> >>> of sigarg in epoll_pwait1 description, so don't read it as end user man page
> >>> yet.
> >>
> >> Fair enough. But I think it would be helpful to say a little more already.
> >> See my comment below.
> >>
> >>> One caveat is the possible failure of the last copy_to_user in epoll_ctl_batch,
> >>> which returns per command error code. Ideas to improve that are welcome!
> >>>
> >>> 1) epoll_ctl_batch
> >>> ------------------
> >>>
> >>> NAME
> >>>        epoll_ctl_batch - modify an epoll descriptor in batch
> >>>
> >>> SYNOPSIS
> >>>
> >>>        #include <sys/epoll.h>
> >>>
> >>>        int epoll_ctl_batch(int epfd, int flags,
> >>>                            int ncmds, struct epoll_ctl_cmd *cmds);
> >>>
> >>> DESCRIPTION
> >>>
> >>>        The system call performs a batch of epoll_ctl operations. It allows
> >>>        efficient update of events on this epoll descriptor.
> >>>
> >>>        Flags is reserved and must be 0.
> >>>
> >>>        Each operation is specified as an element in the cmds array, defined as:
> >>>
> >>>            struct epoll_ctl_cmd {
> >>>
> >>>                   /* Reserved flags for future extension, must be 0. */
> >>>                   int flags;
> >>>
> >>>                   /* The same as epoll_ctl() op parameter. */
> >>>                   int op;
> >>>
> >>>                   /* The same as epoll_ctl() fd parameter. */
> >>>                   int fd;
> >>>
> >>>                   /* The same as the "events" field in struct epoll_event. */
> >>>                   uint32_t events;
> >>>
> >>>                   /* The same as the "data" field in struct epoll_event. */
> >>>                   uint64_t data;
> >>>
> >>>                   /* Output field, will be set to the return code after this
> >>>                    * command is executed by kernel */
> >>>                   int error_hint;
> >>
> >> Why 'error_hint', rather than just stay 'status' or 'result'? I mean
> >> is it really a "hint"? Also, it can be 0 meaning "success" (no error).
> > 
> > Because the convention is weak here: the copy_to_user, to assign values to this
> > field in user provided array, could (very unlikely) fail, at which point the
> > commands are already executed so there is no return. This corner case
> > inconsistency makes it hard to be a contract.
> 
> I still think it's a poor name. Yes, it could unlikely fail.
> Still, 'status' or 'result_status' or something like that is better.
> 
> >>>            };
> >>>
> >>>        Commands are executed in their order in cmds, and if one of them failed,
> >>>        the rest after it are not tried.
> >>>
> >>>        Not that this call isn't atomic in terms of updating the epoll
> >>>        descriptor, which means a second epoll_ctl or epoll_ctl_batch call
> >>>        during the first epoll_ctl_batch may make the operation sequence
> >>>        interleaved. However, each single epoll_ctl_cmd operation has the same
> >>>        semantics as a epoll_ctl call.
> >>
> >> (Good to include that warning!)
> >>
> >>> RETURN VALUE
> >>>
> >>>        If one or more of the parameters are incorrect, -1 is returned and errno
> >>>        is set appropriately. Otherwise, the number of succeeded commands is
> >>>        returned.
> >>>
> >>>        Each error_hint field may be set to the error code or 0, depending on
> >>>        the result of the command. If there is some error in returning the error
> >>>        to user, it may also be unchanged, even though the command may actually
> >>>        be executed. In this case, it's still ensured that the i-th (i = ret)
> >>>        command is the failed command.
> >>
> >> Sorry -- I'm not clear here. Can you say some more here? What sort
> >> of error might there be when "returning the error to the user"?
> > 
> > As explained above. See also the last comment of Omar:
> > 
> > https://lkml.org/lkml/2015/1/21/103
> 
> Okay -- but could you please actually explain this in more detail in the 
> man page. Then others do not need to ask the same question.

OK, I'll name it 'status' and add more details in next revision.

> 
> > <...>
> > 
> >>> DESCRIPTION
> >>>
> >>>        The epoll_pwait1 system call differs from epoll_pwait only in parameter
> >>>        types. The first difference is timeout, a pointer to timespec structure
> >>>        which allows nanosecond presicion; the second difference, which should
> >>>        probably be wrapper by glibc and only expose a sigset_t pointer as in
> >>>        pselect6.
> >>
> >> Here I think it still helps to explain that 'struct sigags' is a sigset_t* +
> >> the size of the pointed-to set.
> > 
> > Yes, will add.
> > 
> >>
> >>>        If timeout is NULL, it's treated as if 0 is specified in epoll_pwait
> >>
> >> The convention I would expect here is that NULL means infinite timeout,
> >> like select(), and timeout == {0,0} would get the "return immediately"
> >> behavior. Why did you choose your convention? And, how does one otherwise 
> >> request an infinite timeout?
> > 
> > I'm wrong. I should follow the conventions of pselect and ppoll. Thanks for
> > catching that.
> 
> Good.
> 
> >>
> >>>        (return immediately). Otherwise it's converted to nanosecond scalar,
> >>>        again, with the same convention as epoll_pwait's timeout.
> >>>
> >>> RETURN VALUE
> >>>
> >>>        The same as said in epoll_pwait(2).
> >>>
> >>> ERRORS
> >>>
> >>>        The same as said in man epoll_pwait(2), plus:
> >>>
> >>>        EINVAL flags is not zero.
> >>>
> >>>
> >>> CONFORMING TO
> >>>
> >>>        epoll_pwait1() is Linux-specific.
> >>>
> >>> SEE ALSO
> >>>
> >>>        epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
> >>
> >> In your earlier patch set, there was the ability to select the clock
> >> used for timeouts. Why did this go away? I'm not sure whether we need
> >> that functionality or not, but it would be good to know why you
> >> dropped it this time.
> >>
> > 
> > No room for another "int clockid". Options:
> 
> Ahh yes. I overlooked that. But again, if your commit message or
> the man page said something about why you dropped this argument,
> then we would not run around in circles having the same 
> conversations repeatedly.
> 
> I keep loving this recent quote from akpm:
> http://lwn.net/Articles/616226/

Good point!

> 
> > 0) Don't have it.
> > 
> > 1) "Or" into lower bits of flags.
> > 
> > 2) Encode into flags bits.
> > 
> > 3) Squash "int clockid" in the last argument (currently "sig" but need to name
> > it "spec", again).
> 
> Well, it's up to you. In https://lkml.org/lkml/2015/1/20/189
> you initially proposed to have the clockid. Do you need it,
> or not? I am agnostic about it. If you do decide you want it,
> then I think (3) is the best option.

Then let's do it this way. clockid would be useful because different
applications care about different clocks.

> 
> I'm very pleased that you expand this man page as you go.
> But, for the next iteration, I think it would be better to make
> it even more complete--it really is helpful to have documentation
> as a starting point to discuss the API, and the better the doc,
> the better the discussion.
> 

Sure, I'll take your points. Thanks for the advice!

Fam

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-02-05  9:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-04 10:36 [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-02-04 10:36 ` [PATCH RFC v2 1/7] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
2015-02-04 10:36 ` [PATCH RFC v2 2/7] epoll: Specify clockid explicitly Fam Zheng
2015-02-04 10:36 ` [PATCH RFC v2 3/7] epoll: Extract ep_ctl_do Fam Zheng
2015-02-04 10:36 ` [PATCH RFC v2 4/7] epoll: Add implementation for epoll_ctl_batch Fam Zheng
2015-02-04 10:36 ` [PATCH RFC v2 5/7] x86: Hook up epoll_ctl_batch syscall Fam Zheng
2015-02-04 10:36 ` [PATCH RFC v2 6/7] epoll: Add implementation for epoll_pwait1 Fam Zheng
2015-02-04 10:36 ` [PATCH RFC v2 7/7] x86: Hook up epoll_pwait1 syscall Fam Zheng
2015-02-04 10:50 ` [PATCH RFC v2 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-02-04 12:44 ` Michael Kerrisk (man-pages)
2015-02-05  1:52   ` Fam Zheng
2015-02-05  7:44     ` Michael Kerrisk (man-pages)
2015-02-05  9:01       ` Fam Zheng
2015-02-04 21:38 ` Andy Lutomirski
2015-02-05  1:51   ` Fam Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).