All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Fam Zheng <famz@redhat.com>, linux-kernel@vger.kernel.org
Cc: mtk.manpages@gmail.com, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kees Cook <keescook@chromium.org>,
	Andy Lutomirski <luto@amacapital.net>,
	David Herrmann <dh.herrmann@gmail.com>,
	Alexei Starovoitov <ast@plumgrid.com>,
	Miklos Szeredi <mszeredi@suse.cz>,
	David Drysdale <drysdale@google.com>,
	Oleg Nesterov <oleg@redhat.com>,
	"David S. Miller" <davem@davemloft.net>,
	Vivek Goyal <vgoyal@redhat.com>,
	Mike Frysinger <vapier@gentoo.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Rashika Kheria <rashika.kheria@gmail.com>,
	Hugh Dickins <hughd@google.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	Josh Triplett <josh@joshtriplett.org>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
Date: Tue, 20 Jan 2015 13:50:37 +0100	[thread overview]
Message-ID: <54BE4F1D.7090807@gmail.com> (raw)
In-Reply-To: <1421747878-30744-6-git-send-email-famz@redhat.com>

Hello Fam Zheng,

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. 

Which applications? Could we have some specific examples? This is a 
complex API, and it needs good justification.

> For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
         ^^^^^^^^^^^^^^ should be epoll_mod_wait

I think you mean to say:

    The ability to batch multiple "epoll_ctl" operations into a single call
    means that even when no wait events are requested (i.e., spec == NULL),
    poll_mod_wait() provides a performance optimization over using multiple
    epoll_ctl() calls.

Right? If yes, please amend the commit message, and this text should
also make its way into the revised man page under a heading "NOTES".

> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.

The "cmds' are not executed in a specified order plus the need to
initialize the 'errors' fields to a positive value feels a bit ugly.
And indeed the whole "command list was only partially run" case
is not pretty. Am I correct to understand that if an error is found
during execution of one of the "epoll_ctl" commands in 'cmds' then
the system call will return -1 with errno set, indicating an error,
even though the epoll interest list may have changed because some
of the earlier 'cmds' executed successfully? This all seems a bit of
a headache for user space.

I have a couple of questions:

Q1. I can see that batching "epoll_ctl" commands might be useful,
since it results in fewer systems calls. But, does it really
need to be bound together with the "epoll_pwait" functionality?
(Perhaps this point was covered in previous discussions, but
neither the message accompanying this patch nor the 0/6 man page
provide a compelling rationale for the need to bind these two
operations together.)

Yes, I realize you might save a system call, but it makes for a
cumbersome API that has the above headache, and also forces the 
need for double pointer indirection in the 'spec' argument (i.e., 
spec is a pointer to an array of structures where each element
in turn includes an 'events' pointer that points to another array).

Why not a simpler API with two syscalls such as:

epoll_ctl_batch(int epfd, int flags,
                int ncmds, struct epoll_mod_cmd *cmds);

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, 
             const sigset_t *sigmask, size_t sigsetsize);

This gives us much of the benefit of reducing system calls, but 
with greater simplicity. And epoll_ctl_batch() could simply return
the number of 'cmds' that were successfully executed.)

Q2. In the man page in 0/6 you said that the 'cmds' were not 
guaranteed to be executed in order. Why not? If you did provide
such a guarantee, then, when using your current epoll_mod_wait(),
user space could do the following:

1. Initialize the cmd.errors fields to zero.
2. Call epoll_ctl_mod()
3. Iterate through cmd.errors looking for the first nonzero 
   field.

> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. 

Yes, that change seemed inevitable. It slightly puzzled me at the time when
Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
though pselect() already had demonstrated the need for higher precision.
I should have called it out way back then :-{.

> The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;

To make the 'ret' change a little more obvious, maybe it's better to write

			ret = kcmds[i].error = -EINVAL;

> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

Likewise:
		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))

Cosmetic point: s/if(/if (/

> +			return -EFAULT;
> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);

If I understand correctly, the implementation means that the
'size_t sigsetsize' field will probably need to be exposed to 
applications. In the existing epoll_pwait() call (as in  ppoll()
and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
However, unless we expect glibc to do some structure copying to/from
a structure that hides this field, then we're going end up exposing
'size_t sigsetsize' to applications. (This could be avoided, if we
split the API as I suggest above. glibc would do the same thing 
in epoll_pwait1() that it currently does in epoll_pwait().)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

WARNING: multiple messages have this Message-ID (diff)
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Fam Zheng <famz@redhat.com>, linux-kernel@vger.kernel.org
Cc: mtk.manpages@gmail.com, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kees Cook <keescook@chromium.org>,
	Andy Lutomirski <luto@amacapital.net>,
	David Herrmann <dh.herrmann@gmail.com>,
	Alexei Starovoitov <ast@plumgrid.com>,
	Miklos Szeredi <mszeredi@suse.cz>,
	David Drysdale <drysdale@google.com>,
	Oleg Nesterov <oleg@redhat.com>,
	"David S. Miller" <davem@davemloft.net>,
	Vivek Goyal <vgoyal@redhat.com>,
	Mike Frysinger <vapier@gentoo.org>, Theodore Ts'o <tytso@mit.edu>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Rashika Kheria <rashika.kheria@gmail.com>,
	Hugh Dickins <hughd@google.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Peter Zijlstra <peterz@infrad
Subject: Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
Date: Tue, 20 Jan 2015 13:50:37 +0100	[thread overview]
Message-ID: <54BE4F1D.7090807@gmail.com> (raw)
In-Reply-To: <1421747878-30744-6-git-send-email-famz@redhat.com>

Hello Fam Zheng,

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. 

Which applications? Could we have some specific examples? This is a 
complex API, and it needs good justification.

> For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
         ^^^^^^^^^^^^^^ should be epoll_mod_wait

I think you mean to say:

    The ability to batch multiple "epoll_ctl" operations into a single call
    means that even when no wait events are requested (i.e., spec == NULL),
    poll_mod_wait() provides a performance optimization over using multiple
    epoll_ctl() calls.

Right? If yes, please amend the commit message, and this text should
also make its way into the revised man page under a heading "NOTES".

> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.

The "cmds' are not executed in a specified order plus the need to
initialize the 'errors' fields to a positive value feels a bit ugly.
And indeed the whole "command list was only partially run" case
is not pretty. Am I correct to understand that if an error is found
during execution of one of the "epoll_ctl" commands in 'cmds' then
the system call will return -1 with errno set, indicating an error,
even though the epoll interest list may have changed because some
of the earlier 'cmds' executed successfully? This all seems a bit of
a headache for user space.

I have a couple of questions:

Q1. I can see that batching "epoll_ctl" commands might be useful,
since it results in fewer systems calls. But, does it really
need to be bound together with the "epoll_pwait" functionality?
(Perhaps this point was covered in previous discussions, but
neither the message accompanying this patch nor the 0/6 man page
provide a compelling rationale for the need to bind these two
operations together.)

Yes, I realize you might save a system call, but it makes for a
cumbersome API that has the above headache, and also forces the 
need for double pointer indirection in the 'spec' argument (i.e., 
spec is a pointer to an array of structures where each element
in turn includes an 'events' pointer that points to another array).

Why not a simpler API with two syscalls such as:

epoll_ctl_batch(int epfd, int flags,
                int ncmds, struct epoll_mod_cmd *cmds);

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, 
             const sigset_t *sigmask, size_t sigsetsize);

This gives us much of the benefit of reducing system calls, but 
with greater simplicity. And epoll_ctl_batch() could simply return
the number of 'cmds' that were successfully executed.)

Q2. In the man page in 0/6 you said that the 'cmds' were not 
guaranteed to be executed in order. Why not? If you did provide
such a guarantee, then, when using your current epoll_mod_wait(),
user space could do the following:

1. Initialize the cmd.errors fields to zero.
2. Call epoll_ctl_mod()
3. Iterate through cmd.errors looking for the first nonzero 
   field.

> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. 

Yes, that change seemed inevitable. It slightly puzzled me at the time when
Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
though pselect() already had demonstrated the need for higher precision.
I should have called it out way back then :-{.

> The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;

To make the 'ret' change a little more obvious, maybe it's better to write

			ret = kcmds[i].error = -EINVAL;

> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

Likewise:
		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))

Cosmetic point: s/if(/if (/

> +			return -EFAULT;
> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);

If I understand correctly, the implementation means that the
'size_t sigsetsize' field will probably need to be exposed to 
applications. In the existing epoll_pwait() call (as in  ppoll()
and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
However, unless we expect glibc to do some structure copying to/from
a structure that hides this field, then we're going end up exposing
'size_t sigsetsize' to applications. (This could be avoided, if we
split the API as I suggest above. glibc would do the same thing 
in epoll_pwait1() that it currently does in epoll_pwait().)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

WARNING: multiple messages have this Message-ID (diff)
From: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
To: Fam Zheng <famz@redhat.com>, linux-kernel@vger.kernel.org
Cc: mtk.manpages@gmail.com, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kees Cook <keescook@chromium.org>,
	Andy Lutomirski <luto@amacapital.net>,
	David Herrmann <dh.herrmann@gmail.com>,
	Alexei Starovoitov <ast@plumgrid.com>,
	Miklos Szeredi <mszeredi@suse.cz>,
	David Drysdale <drysdale@google.com>,
	Oleg Nesterov <oleg@redhat.com>,
	"David S. Miller" <davem@davemloft.net>,
	Vivek Goyal <vgoyal@redhat.com>,
	Mike Frysinger <vapier@gentoo.org>, Theodore Ts'o <tytso@mit.edu>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Rashika Kheria <rashika.kheria@gmail.com>,
	Hugh Dickins <hughd@google.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Peter Zijlstra <peterz@infrad>
Subject: Re: [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait
Date: Tue, 20 Jan 2015 13:50:37 +0100	[thread overview]
Message-ID: <54BE4F1D.7090807@gmail.com> (raw)
In-Reply-To: <1421747878-30744-6-git-send-email-famz@redhat.com>

Hello Fam Zheng,

On 01/20/2015 10:57 AM, Fam Zheng wrote:
> This syscall is a sequence of
> 
> 1) a number of epoll_ctl calls
> 2) a epoll_pwait, with timeout enhancement.
> 
> The epoll_ctl operations are embeded so that application doesn't have to use
> separate syscalls to insert/delete/update the fds before poll. It is more
> efficient if the set of fds varies from one poll to another, which is the
> common pattern for certain applications. 

Which applications? Could we have some specific examples? This is a 
complex API, and it needs good justification.

> For example, depending on the input
> buffer status, a data reading program may decide to temporarily not polling an
> fd.
> 
> Because the enablement of batching in this interface, even that regular
> epoll_ctl call sequence, which manipulates several fds, can be optimized to one
> single epoll_ctl_wait (while specifying spec=NULL to skip the poll part).
         ^^^^^^^^^^^^^^ should be epoll_mod_wait

I think you mean to say:

    The ability to batch multiple "epoll_ctl" operations into a single call
    means that even when no wait events are requested (i.e., spec == NULL),
    poll_mod_wait() provides a performance optimization over using multiple
    epoll_ctl() calls.

Right? If yes, please amend the commit message, and this text should
also make its way into the revised man page under a heading "NOTES".

> The only complexity is returning the result of each operation.  For each
> epoll_mod_cmd in cmds, the field "error" is an output field that will be stored
> the return code *iff* the command is executed (0 for success and -errno of the
> equivalent epoll_ctl call), and will be left unchanged if the command is not
> executed because some earlier error, for example due to failure of
> copy_from_user to copy the array.
> 
> Applications can utilize this fact to do error handling: they could initialize
> all the epoll_mod_wait.error to a positive value, which is by definition not a
> possible output value from epoll_mod_wait. Then when the syscall returned, they
> know whether or not the command is executed by comparing each error with the
> init value, if they're different, they have the result of the command.
> More roughly, they can put any non-zero and not distinguish "not run" from
> failure.

The "cmds' are not executed in a specified order plus the need to
initialize the 'errors' fields to a positive value feels a bit ugly.
And indeed the whole "command list was only partially run" case
is not pretty. Am I correct to understand that if an error is found
during execution of one of the "epoll_ctl" commands in 'cmds' then
the system call will return -1 with errno set, indicating an error,
even though the epoll interest list may have changed because some
of the earlier 'cmds' executed successfully? This all seems a bit of
a headache for user space.

I have a couple of questions:

Q1. I can see that batching "epoll_ctl" commands might be useful,
since it results in fewer systems calls. But, does it really
need to be bound together with the "epoll_pwait" functionality?
(Perhaps this point was covered in previous discussions, but
neither the message accompanying this patch nor the 0/6 man page
provide a compelling rationale for the need to bind these two
operations together.)

Yes, I realize you might save a system call, but it makes for a
cumbersome API that has the above headache, and also forces the 
need for double pointer indirection in the 'spec' argument (i.e., 
spec is a pointer to an array of structures where each element
in turn includes an 'events' pointer that points to another array).

Why not a simpler API with two syscalls such as:

epoll_ctl_batch(int epfd, int flags,
                int ncmds, struct epoll_mod_cmd *cmds);

epoll_pwait1(int epfd, struct epoll_event *events, int maxevents, 
             struct timespec *timeout, int clock_id, 
             const sigset_t *sigmask, size_t sigsetsize);

This gives us much of the benefit of reducing system calls, but 
with greater simplicity. And epoll_ctl_batch() could simply return
the number of 'cmds' that were successfully executed.)

Q2. In the man page in 0/6 you said that the 'cmds' were not 
guaranteed to be executed in order. Why not? If you did provide
such a guarantee, then, when using your current epoll_mod_wait(),
user space could do the following:

1. Initialize the cmd.errors fields to zero.
2. Call epoll_ctl_mod()
3. Iterate through cmd.errors looking for the first nonzero 
   field.

> Also, timeout parameter is enhanced: timespec is used, compared to the old ms
> scalar. This provides higher precision. 

Yes, that change seemed inevitable. It slightly puzzled me at the time when
Davide Libenzi added epoll_wait() that the timeout was milliseconds, even
though pselect() already had demonstrated the need for higher precision.
I should have called it out way back then :-{.

> The parameter field in struct
> epoll_wait_spec, "clockid", also makes it possible for users to use a different
> clock than the default when it makes more sense.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  fs/eventpoll.c           | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h |  5 ++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e7a116d..2cc22c9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2067,6 +2067,66 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
>  			      sigmask ? &ksigmask : NULL);
>  }
>  
> +SYSCALL_DEFINE5(epoll_mod_wait, int, epfd, int, flags,
> +		int, ncmds, struct epoll_mod_cmd __user *, cmds,
> +		struct epoll_wait_spec __user *, spec)
> +{
> +	struct epoll_mod_cmd *kcmds = NULL;
> +	int i, ret = 0;
> +	int cmd_size = sizeof(struct epoll_mod_cmd) * ncmds;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (ncmds) {
> +		if (!cmds)
> +			return -EINVAL;
> +		kcmds = kmalloc(cmd_size, GFP_KERNEL);
> +		if (!kcmds)
> +			return -ENOMEM;
> +		if (copy_from_user(kcmds, cmds, cmd_size)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +	for (i = 0; i < ncmds; i++) {
> +		struct epoll_event ev = (struct epoll_event) {
> +			.events = kcmds[i].events,
> +			.data = kcmds[i].data,
> +		};
> +		if (kcmds[i].flags) {
> +			kcmds[i].error = ret = -EINVAL;

To make the 'ret' change a little more obvious, maybe it's better to write

			ret = kcmds[i].error = -EINVAL;

> +			goto out;
> +		}
> +		kcmds[i].error = ret = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

Likewise:
		ret = kcmds[i].error = ep_ctl_do(epfd, kcmds[i].op, kcmds[i].fd, ev);

> +		if (ret)
> +			goto out;
> +	}
> +	if (spec) {
> +		sigset_t ksigmask;
> +		struct epoll_wait_spec kspec;
> +		ktime_t timeout;
> +
> +		if(copy_from_user(&kspec, spec, sizeof(struct epoll_wait_spec)))

Cosmetic point: s/if(/if (/

> +			return -EFAULT;
> +		if (kspec.sigmask) {
> +			if (kspec.sigsetsize != sizeof(sigset_t))
> +				return -EINVAL;
> +			if (copy_from_user(&ksigmask, kspec.sigmask, sizeof(ksigmask)))
> +				return -EFAULT;
> +		}
> +		timeout = timespec_to_ktime(kspec.timeout);
> +		ret = epoll_pwait_do(epfd, kspec.events, kspec.maxevents,
> +				     kspec.clockid, timeout,
> +				     kspec.sigmask ? &ksigmask : NULL);

If I understand correctly, the implementation means that the
'size_t sigsetsize' field will probably need to be exposed to 
applications. In the existing epoll_pwait() call (as in  ppoll()
and pselect()) the 'size_t sigsetsize' argument is hidden by glibc.
However, unless we expect glibc to do some structure copying to/from
a structure that hides this field, then we're going end up exposing
'size_t sigsetsize' to applications. (This could be avoided, if we
split the API as I suggest above. glibc would do the same thing 
in epoll_pwait1() that it currently does in epoll_pwait().)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

  reply	other threads:[~2015-01-20 12:57 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-20  9:57 [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait" Fam Zheng
2015-01-20  9:57 ` Fam Zheng
2015-01-20  9:57 ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 1/6] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 2/6] epoll: Specify clockid explicitly Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 3/6] epoll: Add definition for epoll_mod_wait structures Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 4/6] epoll: Extract ep_ctl_do Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 5/6] epoll: Add implementation for epoll_mod_wait Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20 12:50   ` Michael Kerrisk (man-pages) [this message]
2015-01-20 12:50     ` Michael Kerrisk (man-pages)
2015-01-20 12:50     ` Michael Kerrisk (man-pages)
2015-01-21  4:59     ` Fam Zheng
2015-01-21  4:59       ` Fam Zheng
2015-01-21  4:59       ` Fam Zheng
2015-01-21  7:52       ` Michael Kerrisk (man-pages)
2015-01-21  7:52         ` Michael Kerrisk (man-pages)
2015-01-21  7:52         ` Michael Kerrisk (man-pages)
2015-01-21  8:58         ` Fam Zheng
2015-01-21  8:58           ` Fam Zheng
2015-01-21 10:37           ` Paolo Bonzini
2015-01-21 10:37             ` Paolo Bonzini
2015-01-21 11:14             ` Fam Zheng
2015-01-21 11:14               ` Fam Zheng
2015-01-21 11:14               ` Fam Zheng
2015-01-21 11:50               ` Paolo Bonzini
2015-01-21 11:50                 ` Paolo Bonzini
2015-01-21 11:50                 ` Paolo Bonzini
2015-01-22 21:12                 ` Andy Lutomirski
2015-01-22 21:12                   ` Andy Lutomirski
2015-01-22 21:12                   ` Andy Lutomirski
2015-01-23  6:20                   ` Fam Zheng
2015-01-23  6:20                     ` Fam Zheng
2015-01-23  6:20                     ` Fam Zheng
2015-01-23  9:56                   ` Paolo Bonzini
2015-01-23  9:56                     ` Paolo Bonzini
2015-01-23  9:56                     ` Paolo Bonzini
2015-01-21 10:34         ` Paolo Bonzini
2015-01-21 10:34           ` Paolo Bonzini
2015-01-21 10:34           ` Paolo Bonzini
2015-01-21  7:56   ` Omar Sandoval
2015-01-21  7:56     ` Omar Sandoval
2015-01-21  7:56     ` Omar Sandoval
2015-01-21  8:59     ` Fam Zheng
2015-01-21  8:59       ` Fam Zheng
2015-01-21  8:59       ` Fam Zheng
2015-01-20  9:57 ` [PATCH RFC 6/6] x86: Hook up epoll_mod_wait syscall Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20  9:57   ` Fam Zheng
2015-01-20 10:37 ` [PATCH RFC 0/6] epoll: Introduce new syscall "epoll_mod_wait" Rasmus Villemoes
2015-01-20 10:37   ` Rasmus Villemoes
2015-01-20 10:53   ` Fam Zheng
2015-01-20 10:53     ` Fam Zheng
2015-01-20 12:48 ` Michael Kerrisk (man-pages)
2015-01-20 12:48   ` Michael Kerrisk (man-pages)
2015-01-20 12:48   ` Michael Kerrisk (man-pages)
2015-01-21  9:05   ` Fam Zheng
2015-01-21  9:05     ` Fam Zheng
2015-01-21  9:05     ` Fam Zheng
2015-01-20 22:40 ` Andy Lutomirski
2015-01-20 22:40   ` Andy Lutomirski
2015-01-20 22:40   ` Andy Lutomirski
2015-01-20 23:03   ` josh
2015-01-20 23:03     ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-01-20 23:03     ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-01-21  5:55   ` Michael Kerrisk (man-pages)
2015-01-21  5:55     ` Michael Kerrisk (man-pages)
2015-01-21  5:55     ` Michael Kerrisk (man-pages)
2015-01-21  9:07   ` Fam Zheng
2015-01-21  9:07     ` Fam Zheng
2015-01-21  9:07     ` Fam Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54BE4F1D.7090807@gmail.com \
    --to=mtk.manpages@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@plumgrid.com \
    --cc=davem@davemloft.net \
    --cc=dh.herrmann@gmail.com \
    --cc=drysdale@google.com \
    --cc=famz@redhat.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=josh@joshtriplett.org \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=luto@amacapital.net \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=mszeredi@suse.cz \
    --cc=oleg@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rashika.kheria@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=vapier@gentoo.org \
    --cc=vgoyal@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.